Size-estimation framework with applications to transitive

Download Report

Transcript Size-estimation framework with applications to transitive

Size-estimation framework with
applications to transitive closure
and reachability
Edith Cohen
AT&T Bell Labs
1996
Presented by Maxim Kalaev
Agenda





Intro & Motivation
Algorithm sketch
The estimation framework
Estimating reachability
Estimating neighborhood sizes
Introduction
o
Descendant counting problem:
“Given a directed graph G compute
for each node number of nodes
reachable from it and the total size
of the transitive closure”
Introduction

E
A

D
S (v) - set of nodes
reachable from node v
Transitive closure size:
T   S (v )
vV

C
B
Example:
|S(‘A’)|=5, |S(‘B’)|=3
T=|S(‘A’)|+|S(‘B’)|+…=
15
Motivation




Applicable for DB-query size
estimations
Data mining
Matrixes multiplications
optimizations
Parallel DFS algorithms
optimizations
Framework algorithm sketch

Least descendant mapping
Given graph G(V,E) with ranks on
it’s nodes compute a mapping for
each node v in V to the least-ranked
node in S(v)
E3
A4
D2
Example:
•LE(‘A’) = 1
•LE(‘C’) = 2
C5
B1
Framework algorithm sketch


The LE (least element) is highly
correlated with size of S(v) !!
The precision can be improved by
applying several iterations with
random ranks assignment and
recalculation of LE
The estimation framework



Let X be a set of elements x with nonnegative weights w(x).
Let Y be a set of labels y,
and mapping S: Y  2 x from labels y to
subsets of x
Our object is to compute an estimate on:
w(S ( y))   w( x) y Y
xS ( y )
- assuming X,Y and weights are given but
it’s costly to calculate w(S(y)) for all y’s
The estimation framework

Assume we have the following LE
(LeastElement) Oracle: given ranks R(x)
on elements of X, LE(y) returns element
with minimal rank in S(y) in O(1) time:
R(le( y))  minxS ( y ) R( x)

The estimation algorithm will perform k
iterations, where k is determined by
required precision
The estimation framework

Iteration:

Independently, for each x in X select a
random rank R(x) from exponential
distribution with parameter w(x)
Exponential distribution function will be:
Fx (t )  1  e w( x)t

(t  0)
Apply LE on selected ranking and store
obtained min-ranks for each y in Y
The estimation framework



Proposition: The distribution of minimum
rank R(le(y)) depends only on w(S(y))
Proof: The min of k r.v.’s with distribution
with parameters w1 ,...,wk
k
has distribution with parameter  j 1 w j
Our objective now is to estimate
distribution parameter from given
samples
The estimation framework



Mean of exponentially distributed with
parameter λ r.e.’s is: 1/λ
We can use this fact to estimate λ from
samples by 1/(samples mean)
Use this to estimate w(S(y)) from minimal
ranks obtained from k iterations:
~( S ( y)) 
w
1

k
k
k
 R(le ( y))  R(le ( y))
i
i 1
k
i 1
i
The estimation framework

More estimators:


Selecting k(1-1/e) –smallest sample of
k samples. (Like median for uniform
distribution)
Using this non-intuitive average
k 1
estimator: w
~ ( S ( y )) 
k
 R(le ( y))
i 1
i
The estimation framework

Complexity so far:




Allowing relative tolerated error ε we need to
1
store log significant bits for R’s
k assignment iterations will take O(k|X|) time
+ k*O(Oracle setup time)
Asymptotic accuracy bounds (the proof
will go later)
y  Y
~ ( S ( y ))  w( S ( y ))}  exp(( 2 k ))
P{ w( S ( y ))  w
~ ( S ( y ))
w( S ( y ))  w
E(
)  O(1 / k )
w( S ( y ))
Estimating reachability

Objective: Given graph G(V,E) for each v
estimate number of its descendants | ~
s (v) |
and size of transitive closure:
~
T for T  | S (v) |
vV

All we need is to implement an Oracle for
calculating LE mapping.
Following algorithm inputs arbitrary
ranking of nodes in sorted order and does
this in O(|E|) time:
Estimating reachability

LE subroutine()


Reverse edges direction of the graph
Iterate until V = {}
Pop v with minimal rank from V
 Run DFS to find all nodes reachable from
v (call this set of nodes U)
 For each node in U set LE == v
V = V \ U
 E = E \ {edges incident to nodes in U}

Estimating reachability


Each estimation iteration takes O(|V|) +
O(|E|) assuming we can sort nodes ranks
in expected linear time.
Accuracy bounds (from estimator bounds)
v V
~
P{ S ( y )  S ( y )   S ( y )  exp(( 2 k ))
~
S ( y)  S ( y)
E(
S ( y)
)  O(1 / k )
Estimating neighborhood sizes


Problem: Given graph G(V,E) with
nonnegative edges lengths should be able
to give an estimation for number of nodes
within distance of at most d from node v –
n(v,d)
Our algorithm will preprocess G in
O(| E | log(|V |) | V | log2 (| V |)) time and after that
will be able to answer (v,d) queries in
O(loglog(|V |)) time
Estimating neighborhood sizes
1
A4
E3
3
2



D2

1
4

1
B1
C5



N(A,7)={A,B,C,D,E}
N(A,3)={A,C,E}
N(D,0)={D}
N(C,∞)={C}
n(A,7)=5
n(A,3)=3
n(D,0)=1
n(C,∞)=1
Estimating neighborhood sizes


After preprocessing of G we will generate
for each node v a list of pairs: ({d1,s1},
{d2,s2},…,{dη,sη}), where d’s stays for
distances and s’s stays for estimated
neighborhoods sizes. The lists will be
sorted by d’s.
To obtain n(v,d) we’ll look for a pair i such
di  dand
d i 1return
si
that
Estimating neighborhood sizes

The algorithm will run k iterations, in each
iteration it will create for each node in G a
least-element list ({d1,v1},
{d2,v2},…,{dη,vη}) such that for any
neighborhood (v,d) we will be able to find
a min-rank node using the list: for di  d d i 1
min-rank node will be: vi
Estimating neighborhood sizes
1
A4
E3
3
2
D2
1
4
1
B1
C5
Neighborhoods:
 N(A,7)={A,B,C,D,E}
 N(A,3)={A,C,E}
 N(D,1)={C,D}
 N(C,∞)={C}
LE-lists:





A: ({A,0}{E,1}{D,2}{B,4})
B: ({B,0})
C: ({C,0})
D: ({D,0})
E: ({E,0}{D,3})
Estimating neighborhood sizes - alg

sub Make_le_lists()
 Assume nodes v1..vn are sorted by rank



in increasing order
Reverse edge direction of G
For i=1..n: di   , vi  (emptylist )
For i=1..n (modified Dijkstra’s alg.)
DO:
(next slide)
Estimating neighborhood sizes - alg
Start with empty heap, place vi on
heap with label 0
Iterate until the heap is empty:
I.
II.


Pop node vk with minimal label d from
the heap
Add pair (d , vi ) to vk’s LE-list,
set dk  d
For each out-edge e j  (v, vi ) of vk:

If v j is in the heap – update its label to
MIN(d j , d  D(e j ))

Else: if d  D(e j )  d j place v j on the heap
with label d  D(e j )
Estimating neighborhood sizes - demo
∞
3
0
1
∞
4
1
2
0
A4
E3
3
2
D2
1
4
1
∞
0 B1
∞
0
A:4
A:1
A:0
D:0
C:0
A:2
E:0
B:0
C5
∞
0
E:3
A
A:0 E:1 D:2 B:4
B
B:0
C
C:0
D
D:0
E
E:0 D:3
Estimating neighborhood sizes analysis

Correctness
Proposition 1:


A node v is placed on heap in iteration i
if an only if dist(vi , v)  dist(v j , v) j  i
If v is placed on the heap in iteration i,
then the pair {dist(vi , v), v} is placed on
v’s list and the value d is updated to be
dist(vi , v)
Estimating neighborhood sizes analysis

Complexity
Proposition 2:

If the ranking is a random permutation,
the expected size of LE-lists is
O(log(|V|)
The proof is based on proposition 1 and
divide&conquer style analysis -
Estimating neighborhood sizes analysis
(proof cont)
Assume LE-list of node u contains x pairs. Consider
nodes v sorted by their distance to node u:
v1,v2,….
According to preposition 1 node v will enter heap
at iteration i iff all the nodes with lower ranks are
farer from u than
is. Random ranks are
expected to partition v1,v2,… sequence such that
vi
rank i will be nearer to u than about half of nodes
with ranks > i.
It follows that x is ~ O( log|V| )
Estimating neighborhood sizes analysis

Complexity (cont)
Running time: Using Fibonacci heaps we
have O(log|V|) pop() operation and O(1)
insert() or update(). Let li be a number
of iterations in which v i was placed on
the heap (0<i≤|V|). It follows that
running time is: O(  li (log| V | outdeg(vi ))
1i |V |
As li is also a size of vi ' s LE-list we get:
O(| V | log2 | V |  | E | log | V |)
Estimating neighborhood sizes
K – iterations issues
What to do with obtained k LE-lists per
node? Naïve way brings us to
O(k*loglog|V|) time.
It can be improved to O(logk +
loglog|V|) by merging the lists and
storing sums of ranks / breakpoint.
 Total algorithm setup time is:
O(k (| V | log2 | V |  | E | log | V |))

This page has intentionally left blank
Summary


General size-estimation framework
Two applications – transitive closure
size estimation and neighborhoods
size estimation
1
A4
E3
3
2
D2
4
THE END!
1
1
B1
C5