A Tight Upper Bound on the Probabilistic Embedding of

Download Report

Transcript A Tight Upper Bound on the Probabilistic Embedding of

Locality Sensitive
Distributed Computing
David Peleg
Weizmann Institute
Structure of mini-course
1. Basics of distributed network algorithms
2. Locality-preserving network
representations
3. Constructions and applications
Part 1: Basic distributed algorithms
•
•
•
•
•
Model
Broadcast
Tree constructions
Synchronizers
Coloring, MIS
The distributed network model
Point-to-point communication network
The distributed network model
Described by undirected weighted
graph G(V,E,w)
V={v1,…,vn} - Processors (network sites)
E - bidirectional communication links
The distributed network model
w: E  R+ edge weight function
representing transmission costs
(usually satisfies triangle inequality)
Unique processor ID's: ID : V  S
S={s1,s2,…} ordered set of integers
Communication
Processor v has deg(v,G) ports
(external connection points)
Edge e represents pair ((u,i),(v,j))
= link connecting u's port i to v's port j
Communication
Message transmission from u to neighbor v:
• u loads M onto port i
• v receives M in input buffer of port j
Communication
Assumption:
At most one message can occupy a
communication link at any given time
(Link is available for next transmission only
after previous message is removed from input
buffer by receiving processor)
Allowable message size = O(log n) bits
(messages carry a fixed number of vertex ID's,
e.g., sender and destination)
Issues unique to distributed computing
There are several inherent differences
between the distributed
and the traditional centralized-sequential
computational models
Communication
In centralized setting: Issue nonexistent
In distributed setting: Communication
• has its limits (in speed and capacity)
• does not come “for free”

should be treated as a computational resource
such as time or memory
(often - the dominating consideration)
Communication as a scarce resource
One common model: LOCAL
Assumes local processing comes for free
(Algorithm pays only for communication)
Incomplete knowledge
In centralized-sequential setting:
Processor knows everything (inputs,
intermediate results, etc.)
In distributed setting:
Processors have very partial picture
Partial topological knowledge
Model of anonymous networks:
Identical nodes
no ID's
no topology knowledge
Intermediate models:
Estimates for network diameter, # nodes etc
unique identifiers
neighbor knowledge
Partial topological knowledge (cont)
Permissive models:
Topological knowledge of large regions, or
even entire network
Structured models:
Known sub-structure, e.g., spanning tree /
subgraph / hierarchical partition / routing
service available
Other knowledge deficiencies
• know only local portion of the input
• do not know who else participates
• do not know current stage of other
participants
Coping with failures
In centralized setting: Straightforward Upon abnormal termination or system crash:
Locate source of failure, fix it and go on.
In distributed setting: Complication When one component fails, others continue
Ambitious goal: ensure protocol runs correctly
despite occasional failures at some machines
(including “confusion-causing failures”, e.g., failed
processors sending corrupted messages)
Timing and synchrony
Fully synchronous network:
• All link delays are bounded
• Each processor keeps local clock
• Local pulses satisfy following property:
Message sent from v to neighbor u
at pulse p of v
arrives u before its pulse p+1
Think of entire system as driven by
global clock
Timing and synchrony
Machine cycle of processors composed of 3 steps:
1. Send msgs to (some) neighbors
2. Wait to receive msgs from neighbors
3. Perform some local computation
Asynchronous model
Algorithms are event-driven :
• No access to global clock
• Messages sent from processor to neighbor
arrive within finite but unpredictable time
Asynchronous model

Clock can't tell if message is coming or not:
perhaps “the message is still on its way”
Impossible to rely on ordering of events
(might reverse due to different message
transmission speeds)
Nondeterminism
Asynchronous computations are inherently
nondeterministic
(even when protocols do not use
randomization)
Nondeterminism
Reason:
Message arrival order may differ from one
execution to another (e.g., due to other events
concurrently occurring in the system –
queues, failures)

Run same algorithm twice on same inputs get different outputs / “scenarios”
Nondeterminism
Complexity measures
• Traditional (time, memory)
• New (messages, communication)
Time
For synchronous algorithm P:
Time(P) = (worst case) # pulses during
execution
For asynchronous algorithm P ?
(Even a single message can incur arbitrary
delay ! )
Time
For asynchronous algorithm P:
Time(P) = (worst-case) # time units from start
to end of execution,
assuming each message incurs delay < 1 time
unit (*)
Time
Note:
1. Assumption (*) is used only for performance
evaluation, not for correctness.
2. (*) does not restrict set of possible scenarios
– any execution can be “normalized” to fit
this constraint
3. “Worst-case” means all possible inputs and
all possible scenarios over each input
Memory
Mem(P) = (worst-case) # memory bits used
throughout the network
MaxMem(P) = maximum local memory
Message complexity
Basic message = O(log n) bits
Longer messages cost proportionally to length
Sending basic message over edge costs 1
Message(P) = (worst case) # basic messages
sent during execution
Distance definitions
Length of path (e1,...,es) = s
dist(u,w,G) = length of shortest u - w path in G
Diameter:
Diam(G) = maxu,vV {dist(u,v,G)}
Distance definitions (cont)
Radius:
Rad(v,G) = maxwV {dist(v,w,G)}
Rad(G) = minvV {Rad(v,G)}
A center of G:
vertex v s.t. Rad(v,G)=Rad(G)
Observe: Rad(G) < Diam(G) < 2Rad(G)
Broadcast
Goal:
Disseminate message M originated at
source r0 to all vertices in network
M
M
M
M
M
M
M
Basic lower bounds
Thm:
For every broadcast algorithm B:
• Message(B) > n-1,
• Time(B) > Rad(r0,G) = W(Diam(G))
Tree broadcast
Algorithm Tcast(r0,T)
• Use spanning tree T of G rooted at r0
• Root broadcasts M to all its children
• Each node v getting M, forwards it to children
Tree broadcast (cont)
Assume: Spanning tree known to all nodes
(Q: what does it mean in distributed
context?)
Tree broadcast (cont)
Claim: For spanning tree T rooted at r0:
• Message(Tcast) = n-1
• Time(Tcast) = Depth(T)
Tcast on BFS tree
BFS (Breadth-First Search) tree =
Shortest-paths tree:
The level of each v in T is dist(r0,v,G)
Tcast (cont)
Corollary:
For BFS tree T w.r.t. r0:
• Message(Tcast) = n-1
• Time(Tcast) < Diam(G)
(Optimal in both)
But what if there is no spanning tree ?
The flooding algorithm
Algorithm Flood(r0)
1. Source sends M on each outgoing link
2. For other vertex v:
• On receiving M first time over edge e:
store in buffer; forward on every edge ≠ e
• On receiving M again (over other edges):
discard it and do nothing
Flooding - correctness
Lemma:
1. Alg. Flood yields correct broadcast
2. Time(Flood)=Q(Rad(r0,G)) = Q(Diam(G))
3. Message(Flood)=Q(|E|)
in both synchronous and asynchronous model
Proof:
Message complexity: Each edge delivers m at
most once in each direction
Neighborhoods
Gl(v) = l-neighborhood of v
= vertices at distance l or less from v
G0(v)
G1(v)
G2(v)
Time complexity
Verify (by induction on t) that:
After t time units, M has already reached
every vertex at distance < t from r0
(= every vertex in the t-neighborhood Gt(r0) )
Note: In asynchronous model, M may have
reached additional vertices
(messages may travel faster)
Time complexity
Note: Algorithm Flood implicitly constructs
directed spanning tree T rooted at r0,
defined as follows:
The parent of each v in T
is the node from which v received M
for the first time
Lemma: In the synchronous model,
T is a BFS tree w.r.t. r0, with depth Rad(r0,G)
Flood time
Note: In the asynchronous model, T may be
deeper (< n-1)
r0
Note: Time is still
O(Diam(G))
even in this case!
Broadcast with echo
Goal: Verify successful completion of broadcast
Method: Collect acknowledgements on a
spanning tree T
Broadcast with echo
Converge(Ack) process - code for v
Upon getting M do:
• For v leaf in T:
- Send up an Ack message to parent
• For v non-leaf:
- Collect Ack messages from all children
- Send Ack message to parent
Collecting Ack’s
Semantics of Ack from v
“Joint ack” for entire subtree Tv rooted at v,
signifying that each vertex in Tv received M

r0 receives Ack from all children
only after all vertices received M
Claim: On tree T,
• Message(Converge(Ack)) = O(n)
• Time(Converge(Ack))=O(Depth(T))
Tree selection
Tree broadcast alg:
Take same tree used for broadcast.
Time / message complexities grow
by const factor.
Flooding alg:
Use tree T defined by broadcast
Synch. model:
BFS tree - complexities double
Asynch. model: no guarantee
Tree selection - complexity
Lemma: In network G(V,E) of diameter D,
complexities of “broadcast with echo” are:
• Message(FloodEcho)=O(|E|)
• Time(FloodEcho)=
O(D) in synchronous model,
O(n) in asynchronous model.
• In both models, M reaches all by time D
BFS tree constructions
In synchronous model:
Algorithm Flood generates BFS tree of optimal
• Message(Flood)=Q(|E|)
• Time(Flood) = Q(Diam(G))
In asynchronous model:
Tree generated by Algorithm Flood is not BFS
Level-synchronized BFS construction (Dijkstra)
Idea:
• Develop BFS tree from root r0 in phases,
level by level
• Build next level by adding all vertices
adjacent to nodes in lowest tree level
After p phases: Constructed partial tree Tp
• The tree Tp is a BFS tree for Gp(r0)
• Each v in Tp knows its parent, children, depth
Level-synchronized BFS (Dijkstra)
Level-synchronized BFS (Dijkstra)
Phase p+1:
1. r0 broadcasts message Pulse on Tp
2. Each leaf of Tp sends “exploration” message
Layer to all neighbors except parent.
Level-synchronized BFS (Dijkstra)
3. Vertex w receiving Layer message for the
first time (possibly from many neighbors)
picks one neighbor v, lists it as parent,
sends back Ack messages to all Layer
messages
Vertex w in Tp receiving
Layer message sends back
Ack messages to all Layer
messages
Level-synchronized BFS (Dijkstra)
4. Each leaf v collects acks on exploration msgs.
If w chose v as parent, v lists w as child
5. Once receiving Ack on all Layer messages,
leaf v Ack s parent.
Acks are convergecast on Tp
back to r0.
6. Once convergecast
terminates, r0 starts
next phase
Analysis
Correctness:
By induction on p, show:
• After phase p,
variables parent and child
define legal BFS tree
spanning r0's p-neighborhood
 Algorithm constructs BFS
tree rooted at r0.
Analysis (cont)
Time complexity:
Time(Phase p) = 2p+2
Time = ∑p 2p+2
= O(Diam2(G))
Analysis (cont)
Message complexity:
For integer p > 0 let
Vp = vertices in layer p
Ep= edges internal to Vp
Ep,p+1 = edges between
Vp and Vp+1
Analysis (cont)
Phase p:
Layer msgs of phase p sent only on Ep and Ep,p+1

• Only O(1) messages
sent over each edge
• Tp edges are traversed
twice (< 2n messages)
Analysis (cont)
Comm(Phase p) = O(n)
+ O(|Ep|+|Ep,p+1|)
In total: Comm
= p O(n + |Ep|+|Ep,p+1|)
=O(n Diam(G)+|E|)
Complexities of BFS algorithms
Reference
Messages
Time
Lower bound
(+ Sync. Model)
E
D
Dijkstra
E+ n D
D2
Bellman-Ford
nE
D
Best known
E + n log3 n
D log3 n
Synchronizers
Goal: Transform algorithm for synchronous
networks into algorithm
for asynchronous networks.
Motivation:
Algorithms for the synchronous model easier to design / debug / test
than ones for the asynchonous model
(Behavior of asynchronous system - harder to analyze)
Synchronizers
Synchronizer: Methodology for such simulation:
Given algorithm S for synchronous network,
and synchronizer s,
combine them to yield protocol A=s(S)
executable on asynchronous network
Correctness requirement:
A's execution on asynchronous network “similar” to S's execution on synchronous one
Underlying simulation principles
Combined protocol A composed of two parts:
• original component
• synchronization component
(each with its own local var's and msg types)
Pulse generator: Processor v has pulse var Pv,
generating sequence of local clock pulses,
i.e., periodically increasing Pv=0,1,2,...
Underlying simulation principles
Under protocol A,
each v performs during time interval when Pv=p
precisely the actions it should perform during
round p of the synchronous algorithm S
Def: t(v,p) = global time when v increased its
pulse to p.
We say that “v is at pulse Pv=p”
during the time interval
t(v,p) = [t(v,p),t(v,p+1))
Underlying simulation principles
Pulse compatibility:
If processor v sends original message M
to neighbor w during its pulse Pv=p
then w receives M during its pulse Pw=p
Correct simulations
Synchronous protocol S
Simulating protocol A=s(S)
Execution zS = zS(G,I) of S in synch' network
Execution zA = zA(G,I) of A in asynch' network
(same topology G, same input I)
Correct simulations (cont)
Similar executions:
Executions zA and zS are similar if
for every v,
for every neighbor w,
for every original local variable X at v,
for every integer p > 0:
1. X value at beginning of pulse p in zA
= X value at beginning of round p in zS
Correct simulations (cont)
2. Original messages sent by v to w
during pulse p in execution zA same as those sent by v to w
during round p in execution zS
3. Original messages received by v from w
during pulse p in zA same as those received by v from w
during round p in zS
4. Final output of v in zA –
same as in zS
Correct simulations (cont)
Correct simulation:
Asynchronous protocol A simulates
synchronous protocol S if
for every network topology and initial input,
the executions of A and S are similar
Synchronizer s is correct if
for every synchronous protocol S,
protocol A=s(S) simulates S
Correct simulations (cont)
Lemma:
If synchronizer s guarantees pulse compatibility
then it is correct
Goal: Impose pulse compatibility
Correct simulations (cont)
Fundamental question:
When is it permissible for a processor
to increase its pulse number?
Correct simulations (cont)
First answer:
Increase pulse from p to p+1
once certain that original messages of
algorithm S sent by neighbors during their
pulse p will no more arrive
Question:
How can that be ensured?
Correct simulations (cont)
Readiness property:
Processor v is ready for pulse p,
denoted Ready(v,p),
once it already received all algorithm messages
sent to it by neighbors during their pulse p-1.
Readiness rule:
Processor v may generate pulse p once it
finished its original actions for pulse p-1,
and Ready(v,p) holds.
Correct simulations (cont)
Problem: Obeying the readiness rule
does not impose pulse compatibility
(Bad scenario:
v is ready for pulse p, generates pulse p,
sends msg of pulse p to neighbor w,
yet w is still “stuck” at pulse p-1,
waiting for msgs of pulse p-1
from some other neighbor z)
Correct simulations (cont)
Fix: Delay messages that arrived too early
Delay rule:
Receiving in pulse p-1 msg sent from w
on its pulse p, temporarily store it;
Process it only after generating pulse p
Correct simulations (cont)
Lemma: A synchronizer imposing both
readiness and delay rules
guarantees pulse compatibility
Corollary: If synchronizer s imposes the
readiness and delay rules,
then it is correct
Implementation phases
Problem: To satisfy Ready(v,p), v must ensure
that it already received all algorithm messages
sent to it by its neighbors in pulse p-1

If w did not send any message to v in pulse p-1,
then v must wait forever
(link delays in an asynchronous network are
unpredictable...)
Implementation phases
Conceptual solution:
Employ two communication phases
Phase A:
1. Each processor sends its original messages
2. Processor receiving message from neighbor
sends Ack

Each processor learns (within finite time)
that all messages it sent during pulse p
have arrived
Implementation phases
Safety property: Node v is safe w.r.t. pulse p,
denoted Safe(v,p), if all messages it sent during
pulse p have already arrived.
Fact:
If each neighbor w of v satisfies Safe(v,p), then
v satisfies Ready(v,p+1)

Node may generate new pulse once it learns all
neighbors are safe w.r.t. current pulse.
Implementation phases
Phase B:
Apply a procedure to let each processor know
when all its neighbors are safe w.r.t. pulse p
Synchronizer constructions:
• based on 2-phase strategy
• all use same Phase A procedure
• but different Phase B procedures
Synchronizer complexity
Initialization costs:
Tinit(s) and Cinit(s) = time and message costs of
initialization procedure setting up synchronizer s
Pulse overheads:
Cpulse(s) = cost of synchronization messages
sent by all vertices during their pulse p
Tpulse(s) = ?
Synchronizer complexity
Tpulse(s) = ?
Time periods during which
different nodes v is at pulse p…
Synchronizer complexity
Let tmax(p) = maxvV {t(v,p)}
(time when slowest processor reached pulse p)
tmax(1)
tmax(2)
tmax(3)
Tpulse(s) = maxp>0 {tmax(p+1) - tmax(p)}
Synchronizer complexity
Lemma:
For synchronous algorithm S
and asynchronous A = s(S),
• Comm(A) = Cinit(s) + Comm(S)
+ Time(S) * Cpulse(s),
• Time(A) = Tinit(s) + Time(S) * Tpulse(s)
Basic synchronizer a
Phase B of synchronizer a : Direct.
After executing pulse p,
when processor v learns it is safe,
it reports this fact to all neighbors.
Claim: Synchronizer a is correct.
Basic synchronizer a
Claim:
• Cinit(a)=O(|E|)
• Tinit(a)=O(Diam)
• Cpulse(a)=O(|E|)
• Tpulse(a)=O(1)
Note: Synchronizer a is optimal for trees, planar
graphs and bounded-degree networks (mesh,
butterfly, cube-connected cycle, ring,..)
Basic synchronizer b
Assume: rooted spanning tree T in G
Phase B of b: convergecast process on T
Basic synchronizer b
• When processor v learns all its descendants
in T are safe, it reports this fact to parent.
• When r0 learns all processors in G are safe,
it broadcasts this along tree.
Convergecast ends
 all nodes are safe
Stage 1:
Stage 2:
Basic synchronizer b
Claim: Synchronizer b is correct.
Claim:
• Cinit(b)=O(n|E|)
• Tinit(b)=O(Diam)
• Cpulse(b)=O(n)
• Tpulse(b)=O(Diam)
Note: Synchronizer b is optimal for
bounded-diameter networks.
Understanding the effects of locality
Model:
• synchronous
• simultaneous wakeup
• large messages allowed
Goal:
Focus on limitations stemming from
locality of knowledge
Symmetry breaking algorithms
Vertex coloring problem: Associate a color jv
with each v in V, s.t. any two adjacent vertices
have different color
Naive solution: Use unique vertex ID's
= legal coloring by n colors
Goal: obtain coloring with few colors
Symmetry breaking algorithms
Basic palette reduction procedure:
Given legal coloring by m colors, reduce # colors
D(G) = max vertex degree in G
Reduction idea:
v's neighbors occupy at most D distinct colors

D+1 colors always suffice to find a “free” color
Symmetry breaking algorithms
First Free coloring
(For set of colors P and node set W v V)
FirstFree(W,P) = min color in P
that is currently not used by any vertex in W
Standard palette:
Pm = {1,...,m}, for m > 1
Sequential color reduction
For every node v do (sequentially):
jv  FirstFree(G(v),PD+1)
/* Pick new color 1 < j < D+1, different from
those used by the neighboring nodes */
Procedure Reduce(m) - code for v
Palette:
P3 = {1,...,3},
3
1
3
2
2
1
1
Procedure Reduce(m) - parallelization
Code for v:
For round j= D+2 to m do:
/* all nodes colored j re-color themselves
simultaneously */
• If v's original color is jv = j then do:
1. Set jv  FirstFree(G(v),PD+1)
/* Pick new color 1 < j < D+1, different
from those used by the neighbors */
2. Inform all neighbors
Procedure Reduce(m) - code for v
Lemma:
• Procedure Reduce produces a legal coloring
of G with D+1 colors
• Time(Reduce(m)) = m-D+1
Proof:
Time bound:
Each iteration requires one time unit.
Procedure Reduce(m) - code for v
Correctness: Consider iteration j.
• When node v re-colors itself, it always finds a
non-conflicting color
(< D neighbors, and D+1 color palette)
• No conflict with nodes recolored in earlier
iterations (or originally colored 1, 2, …, D+1).
• No conflict with choices of other nodes in
iteration j (they are all mutually nonadjacent,
by legality of original coloring j)
 New coloring is legal
3-coloring trees
Goal: color a tree T with 3 colors in time O(log*n)
Recall:
log(1)n = log n
log(i+1)n = log(log(i)n)
log*n = min { i | log(i)n < 2 }
General idea:
• Look at colors as bit strings.
• Attempt to reduce # bits used for colors.
3-coloring trees
|jv| = # bits in jv
jv [i] = ith bit in the bit string representing jv
Specific idea: Produce new color from old jv:
1. find index 0 < i < |jv| in which
v's color differs from its parent's.
(Root picks, say, index 0.)
2. set new color to: i , jv[i]
/* the index i concatenated with the bit jv[i] */
3-coloring trees
root
Old coloring:
 We will show:
a. neighbors have different new colors
b. length of new coloring is roughly logarithmic
in that of previous coloring
3-coloring trees (cont)
Algorithm SixColor(T) - code for v
Let jv  ID(v)
/* initial coloring */
Repeat:
• l  |jv|
• If v is the root then set I  0
else set I  min{ i | jv[i]≠jparent(v)[i] }
•Set jv  I; jv[I]
•Inform all children of this choice
until |jv| = l
3-coloring trees (cont)
Lemma:
In each iteration, Procedure SixColor produces
a legal coloring
Proof:
Consider iteration i, neighboring nodes v,w  T,
v=parent(w).
I = index picked by v;
J = index picked by w
3-coloring trees (cont)
If I≠ J:
new colors of v
and w differ in
1st component
v
If I=J:
new colors differ vw
in 2nd
w
component
i=1
i=2
j=2
j=2
3-coloring trees (cont)
Ki = # bits in color representation after ith
iteration.
(K0=K=O(log n) = # bits in original ID coloring.)
Note: Ki+1 = dlog Kie + 1
 2nd coloring uses about log(2)n bits,
3rd - about log(3)n, etc
3-coloring trees (cont)
Lemma: Final coloring uses six colors
Proof:
Final iteration i satisfies Ki = Ki-1 < 3
 In final coloring, there are < 3 choices for the
index to the bit in (i-1)st coloring, and two
choices for the value of the bit
 Total of six possible colors
Reducing from 6 to 3 colors
Shift-down operation:
Given legal coloring of T:
1. re-color each non-root vertex by color of
parent
2. re-color root by new color (different from
current one)
Reducing from 6 to 3 colors
Claim:
1. Shift-down step preserves coloring legality
2. In new coloring, siblings are monochromatic
Reducing from 6 to 3 colors
Cancelling color x, for x  {4,5,6}:
1. Perform shift-down operation on current
coloring,
2. All nodes colored x apply FirstFree(G(v),P3)
/* choose a new color from among {1,2,3}
not used by any neighbor */
Reducing from 6 to 3 colors
Example: cancelling color 4
shift-down
FirstFree
Claim: Rule for cancelling color x produces
legal coloring
Overall 3 coloring process
1. Invoke Algorithm SixColor(T)
2. Cancel colors 6, 5, 4
(O(log*n) time)
(O(1) time)
Thm:
There is a deterministic distributed algorithm
for 3-coloring trees in time O(log*n)
D+1-coloring for arbitrary graphs
Goal: Color G of max degree D
with D+1 colors in O(D log n) time
Node ID’s in G = K-bit strings
Idea: Recursive procedure ReColor(x),
where x = binary string of < K bits.
Ux = { v | ID(v) has suffix x }
(|Ux| < 2K-|x|)
The procedure is applied to Ux, and returns with
a coloring of U vertices with D+1 colors.
D+1-coloring for arbitrary graphs
Procedure ReColor(x) - intuition
If |x|=K (Ux has < one node) then return color 0.
Otherwise:
1. Separate Ux into two sets U0x and U1x
2. Recursively compute D+1 coloring for each,
invoking ReColor(0x) and ReColor(1x)
3. Remove conflicts between the two colorings
by altering the colors of U1x vertices, color by
color, as in Procedure Reduce.
ReColor – distributed implementation
Procedure ReColor(x) – code for v  Ux
/* ID(v)=a1a2... aK , x = aK-|x|+1... aK */
• Set l  |x|
• If l = K
/* singleton Ux = {v} */
then set jv  0 and return
• Set b  aK-l
/* v  Ubx */
• jv  ReColor(bx).
Procedure ReColor(x) - code for v
/* Reconciling the colorings on U0x and U1x */
• If b=1 then do:
• For round i=1 through D+1 do:
• If jv=i then do:
• jv  FirstFree(G(v), PD+1)
(pick a new color 1 < j < D+1, different
from those used by any neighbor)
• Inform all neighbors of this choice
Analysis
Lemma:
For l = empty word:
• Procedure ReColor(l) produces legal coloring
of G with D +1 colors
• Time(ReColor(l)) = O(D log n)
Analysis
Proof:
Sub-claim: ReColor(x) yields legal D+1-coloring
for vertices of subgraph G(Ux) induced by Ux
Proof:
By induction on length of parameter x.
Base (|x|=K): Immediate
General case: Consider run of ReColor(x).
Note: Coloring assigned to U0x is legal (by Ind.
Hyp.), and does not change later.
Analysis (cont)
Consider v in U1x recoloring itself in some
iteration i via the FirstFree operation.
Note: v always finds a non-conflicting color:
• No conflict with nodes of U1x recolored in
earlier iterations, or with nodes of U0x
• No conflict with other nodes that recolor in
iteration i (mutually non-adjacent, by legality of
coloring generated by ReColor(1x) to set U1x)
 new coloring is legal
Analysis (cont)
Time bound: Each of the K=O(log n) recursion
levels requires D+1 time units
 O(D log n) time
Lower bound for 3-coloring the ring
Lower bound: Any deterministic distributed
algorithm for 3-coloring n-node rings requires
at least (log*n-1)/2 time.
Applies in strong model:
After t time units, v knows everything known
to anyone in its t-neighborhood.
In particular, given no inputs but vertex ID's:
after t steps, node v learns the topology of its
t-neighborhood Gt(v) (including ID's)
Lower bound for 3-coloring the ring
On a ring, v learned a (2t+1)-tuple (x1,...,x2t+1)
from space W2t+1,n, where
Ws,n= {(x1,...,xs) | 1 < xi < n, xi≠xj},
• xt+1 = ID(v),
• xt and xt+2 = ID's of v’s
two neighbors,
• etc.
Coloring lower bound (cont)
W.l.o.g., any deterministic t(n)-step algorithm At
for coloring a ring in cmax colors follows a
2-phase policy:
• Phase 1: For t rounds, exchange topology info.
At end, each v holds a tuple a(v)  W2t+1,n
• Phase 2: Select jv  jA(a(v)) where
jA : W2t+1,n  {1,...,cmax}
is the coloring function of algorithm A
Coloring lower bound (cont)
Define a graph Bs,n = (Ws,n, Es,n), where
Es,n contains all edges of form
(x1,x2,...,xs)

(x2,...,xs,xs+1)
satisfying x1 ≠ xs+1
Coloring lower bound (cont)
Note: Two s-tuples of Ws,n ,
(x1,x2,...,xs) and (x2,...,xs,xs+1) ,
are connected in Bs,n

they may occur as
tuples corresponding
to two neighboring
nodes in some ID
assignment for the
ring.
s
s
Coloring lower bound (cont)
Lemma: If Algorithm At produces a legal
coloring for any n-node ring,
then the function jA defines a legal coloring
for the graph B2t+1,n
Proof:
Suppose jA is not a legal coloring for B2t+1,n ,
i.e., there exist two neighboring vertices
a=(x1,x2,...,x2t+1) and b=(x2,...,xs,x2t+2) in B2t+1,n
s.t. jA(a) = jA(b)
Coloring lower bound (cont)
Consider n-node ring with the following
ID assignments:
a
b
Coloring lower bound (cont)
Then algorithm A colors the neighboring nodes
v and w by colors jA(a) and jA(b) respectively.
These colors are identical,
so the ring
coloring is
illegal;
contradiction
a
b
Coloring lower bound (cont)
Corollary:
If the n-vertex ring can be colored in t rounds
using cmax colors, then c(B2t+1,n) < cmax
Thm:
Any deterministic distributed algorithm
for coloring the (2n)-vertex ring with two colors
requires at least n-1 rounds.
Coloring lower bound (cont)
Proof:
By Corollary, if there is a 2-coloring algorithm
working in t time units,
then c(B2t+1,2n) < 2
(or, B2t+1,n is 2-colorable)
hence B2t+1,n is bipartite.
But for t < n-2, this leads to contradiction, since
B2t+1,2n contains an odd length cycle,
hence it is not bipartite.
Coloring lower bound (cont)
The odd cycle:
(2,…,2t+1, 2t+1)
(1,2,…,2t+1)
(2t+3,1,2,…,2t)
(3,…, 2t+3)
(4,…, 2t+3,1)
(5,…, 2t+3,1,2)
Coloring lower bound (cont)
Returning to 3-coloring: We prove the following:
Lemma: c(B2t+1,n) > log(2t)n
Def: Family of directed graphs Bs,n = (Ws,n, Es,n),
Ws,n = {(x1,...,xs) | 1 < x1 < ... < xs < n },
Es,n = all (directed) arcs
(x1,x2,...,xs)

(x2,...,xs,xs+1)
Coloring lower bound (cont)
Claim: c(Bs,n) < c(Bs,n)
Proof: The undirected version of Bs,n is a
subgraph of Bs,n

To prove the lemma, i.e., bound c(B2t+1,n),
it suffices to show that c(B2t+1,n) > log(2t)n
Coloring lower bound (cont)
Recursive representation for directed graphs B:
based on directed line graphs
in digraph H
e
e’
in DL(H)

e
e’
Def: For a directed graph H=(U,F),
line graph of H, DL(H), is a directed graph with
V(DL(H)) = F,
E(DL(H)) contains an arc e,e' (for e,e‘ F)
iff in H, e' starts at the vertex in which e ends
Coloring lower bound (cont)
Lemma:
1. B1,n = complete directed graph on n nodes
(with every two vertices connected by one
arc in each direction)
2. Bs+1,n = DL(Bs,n)
Proof
Claim 1: immediate from definition.
Coloring lower bound (cont)
Claim 2: Establish appropriate isomorphism
between Bs+1,n and DL(Bs,n) as follows.
Consider
e = (x1,...,xs) , (x2,...,xs+1)
e = arc of Bs,n = node of DL(Bs,n)
Map e to node (x1,...,xs,xs+1) of Bs+1,n
Straightforward to verify this mapping
preserves the adjacency relation
Coloring lower bound (cont)
Coloring lower bound (cont)
Lemma: For every directed graph H,
c(DL(H)) > log c(H)
Proof: Let k=c(DL(H)).
Consider k-coloring F of DL(H).
F = edge coloring for H, s.t. if e' starts at vertex
in which e ends, then F(e') ≠ F(e).
F coloring can be used to create a 2k-coloring j
for H, by setting the color of node v to be the set
jv = { F(e) | e ends in v }
Coloring lower bound (cont)
Note:
j uses < 2k colors.
j is legal

c(H) < 2k, proving the lemma.
Coloring lower bound (cont)
Corollary: c(Bs,n) > log(s-1)n
Proof:
Immediate from last two lemmas:
(1) B1,n = complete directed n node graph
Bs+1,n = DL(Bs,n)
(2)c(DL(H)) > log c(H)
Corollary: c(B2t+1,n) > log(2t)n
Corollary: c(B2t+1,n) > log(2t)n
Coloring lower bound (cont)
Thm:
Any deterministic distributed algorithm for
coloring n-vertex rings with 3 colors
requires time t > (log*n-1)/2
Proof:
If A is such an algorithm and it requires t rounds,
then log(2t)n < c(B2t+1,n) < 3,
 log(2t+1)n < 2
 2t+1 > log*n
Distributed Maximal Independent Set
Goal: Select MIS in graph G
Independent set: U  V s.t.
u,w  U

u,w non-adjacent
Maximal IS:
Adding any vertex violates independence
Distributed Maximal Independent Set
Note: Maximal IS ≠ Maximum IS
Maximal
IS
Non-maximal,
Non-maximum
IS
Maximum
IS
Distributed Maximal Independent Set
Sequential greedy MIS construction
Set U  V, M  
While U ≠  do:
• Pick arbitrary v in U
• Set U  U - G(v)
• Set M  M [ {v}
Distributed Maximal Independent Set
Note:
1.M is independent throughout process
2.once U is exhausted, M forms an MIS
Complexity: O(|E|) time
Distributed implementation
Distributedly marking an MIS:
Set local boolean variable b at each v:
v  MIS
v  MIS


b=1
b=0
Distributed implementation
Algorithm MIS-DFS
• Single token traversing G in depth-first order,
marking vertices as in / out of MIS.
• On reaching an unmarked vertex:
1. add it to MIS (by setting b to 1),
2. mark its neighbors as excluded from MIS
Complexity:
• Message(MIS-DFS)=O(|E|)
• Time(MIS-DFS)=O(n)
Lexicographically smallest MIS
LexMIS: The lexicographically smallest MIS
over V={1,…,n}
{1,3,5,9} < {1,3,7,9}
Note: Possible to construct LexMIS by simple
sequential (non-distributed) procedure
(go over node list 1,2,…:
- add v to MIS,
- erase its neighbors from list)
Distributed LexMIS computation
Algorithm MIS-Rank - code for v
• Invoke Procedure Join
• On getting msg Decided(1) from neighbor w
do:
- Set b  0
- Send Decided(0) to all neighbors
• On getting msg Decided(0) from neighbor w
do:
- Invoke Procedure Join
Distributed LexMIS computation
Procedure Join – code for v
• If every neighbor w of v with larger ID
has decided b(w)=0
then do:
- Set b  1
- Send Decided(1) to all neighbors
Complexity – Distributed LexMIS
Claim:
• Message(MIS-Rank)=O(|E|)
• Time(MIS-Rank)=O(n)
Note: Worst case complexities no better than
naive sequential procedure
Reducing coloring to MIS
Procedure ColorToMIS(m) - code for v
For round i=1 through m do:
- If v's original color is jv = i then do:
• If None of v's neighbors joined MIS yet
then do:
Decide b  1 (join MIS)
Inform all neighbors
• Else decide b  0
Analysis
Lemma: Procedure ColorToMIS constructs MIS
for G in time m
Proof:
Independence:
• Node v that joins MIS in iteration i
is not adjacent to any w that joined MIS earlier.
• It is also not adjacent to any w trying to join in
current iteration
(since they belong to same color class)
Analysis
Maximality:
By contradiction.
For M marked by procedure,
suppose there is a node v  M
s.t. M [ {v} is independent.
Suppose jv=i.
Then in iteration i, the decision made by v
was erroneous.
Analysis (cont)
Corollary:
Given algorithm for coloring G
with f(G) colors in time T(G),
it is possible to construct MIS for G
in time T(G)+f(G)
Corollary:
There is a deterministic distributed MIS
algorithm for trees / bounded-degree graphs
with time O(log*n).
Analysis (cont)
Corollary:
There is a deterministic distributed MIS
algorithm for arbitrary graphs with time
complexity O(D(G) log n).
Lower bound for MIS on rings
Fact: Given MIS for the ring, it is possible to 3color the ring in one round.
Proof: v  MIS: takes color 1,
sends “2” to left neighbor
w  MIS: takes color 2 if it gets msg “2”;
otherwise takes color 3
Reducing coloring to MIS (cont)
Validity of 3-coloring: Since MIS vertices are
spaced 2 or 3 places apart around the ring
Corollary: Any deterministic distributed MIS
algorithm for the n-vertex ring requires at least
(log*n-3)/2 time.
Randomized distributed MIS algorithm
Doable in time O(log n)
“Store and forward” routing schemes
Routing scheme: Mechanism specifying for
each pair u,v  V a path in G connecting u to v
Routing labels: Labeling assignment
Labels = (v1,...,vn) for G vertices
Headers = { allowable message headers }
“Store and forward” routing schemes
Data structures: Each v stores:
1. Initial header function
Iv: Labels  Headers
2. Header function
Hv: Headers  Headers
3. Port function
Fv: Headers  [1.. deg(v,G)]
Forwarding protocol
For u to send a message M to v:
1. Prepare header h=Iu(v), attach it to M
(Typically consists of label of destination, v,
plus some additional routing information)
2. Load M onto exit port i=Fu(h)
Forwarding protocol
Message M with header h' arriving at node w:
• Read h', check whether w = final destination.
• If not:
1. Prepare new header by setting h=Hw(h')
replace old header h' attached to M by h
2. Compute exit port by setting i=Fu(h)
3. load M onto port i
Routing schemes (cont)
For every pair u,v, scheme RS specifies a route
r(RS,u,v)=(u=w1,w2,...,wj=v),
through which M travels from u to v.
|r(RS,u,v)| = route length
Partial routing schemes: Schemes specifying a
route only for some vertex pairs in G
Performance measures
w(e) = cost of using link e
~ estimated link delay for message sent on e
Comm(RS,u,v) = cost of uv routing by RS
= weighted route length, |r(RS,u,v)|
Performance measures (cont)
Stretch factor:
Given routing scheme RS for G,
we say RS stretches the path from u to v by
Dilation(RS,u,v) = Comm(RS,u,v)| / dist(u,v)
Dilation(RS,G) = maxu,vV {Dilation(RS,u,v)}
Performance measures (cont)
Memory requirement:
Mem(v,Iv,Hv,Fv) = # memory bits for storing the
label and functions Iv, Hv, Fv in v.
Total memory requirement of RS:
Mem(RS)=∑vLabels Mem(v,Iv,Hv,Fv)
Maximal memory requirement of RS:
MaxMem(RS)=maxvLabels Mem(v,Iv,Hv,Fv)
Routing strategies
Routing strategy: Algorithm computing a routing
scheme RS for every G(V,E,w).
A routing strategy has stretch factor k
if for every G it produces scheme RS with
Dilation(RS,G) < k.
Memory requirement of routing strategy
(as function of n) =
maximum (over all n-vertex G) memory
requirement of routing schemes produced.
Routing strategies (cont)
Solution 1: Full tables routing (FTR)
Port function Fv stored at v specifies entire table
(one entry per each destination u ≠ v) listing exit
port used for forwarding M to u.
Port function
for node 1:
FTR (cont)
Note: The pointers to a particular destination u
form shortest path tree rooted at u
Optimal communication cost:
Dilation(FTR,G)=1
Disadvantage: Expensive for large systems
(each v stores O(n log n) bit routing table)
FTR (cont)
Example: Unweighted ring
Consider unit cost n-vertex ring.
FTR strategy implementation:
• Label vertices consecutively as 0,...,n-1
• Route from i to j along shorter of two ring
segments (inferred from labels i,j)
Stretch = 1
(optimal routes)
2log n bits per vertex (stores own label and n)
Solution 2: Flooding
Origin broadcasts M throughout entire network.
Requires no routing tables
(optimal memory)
Non-optimal communication (unbounded stretch)
FTR vs. Flooding:
Extreme endpoints
of communication-memory tradeoff
Part 2: Representations
1. Clustered representations
• Basic concepts: clusters, covers, partitions
• Sparse covers and partitions
• Decompositions and regional matchings
2. Skeletal representations
• Spanning trees and tree covers
• Sparse and light weight spanners
Basic idea of
locality-sensitive distributed computing
Utilize locality to both simplify control structures
and algorithms and reduce their costs
Operation performed in large network may
concern few processors in small region
(Global operation may have local sub-operations)
Reduce costs by utilizing “locality of reference”
Components of locality theory
• General framework, complexity measures and
algorithmic methodology
• Suitable graph-theoretic structures and
efficient construction methods
• Adaptation to wide variety of applications
Fundamental approach
Clustered representation:
• Impose clustered hierarchical organization on
arbitrary given network
• Use it efficiently for bounding complexity of
distributed algorithms.
Skeletal representation:
• Sparsify given network
• Execute applications on remaining skeleton,
reducing complexity
Clusters, covers and partitions
Cluster = connected subset of vertices S  V.
Cover of G(V,E,w) = collection of clusters
S={S1,...,Sm} containing all vertices of G
(i.e., s.t. [ S = V).
Partitions
Partial partition of G = collection of disjoint
clusters S ={S1,...,Sm}, i.e., s.t. S Å S'= 
Partition= cover and partial partition.
Evaluation criteria
Locality and Sparsity
Locality level: cluster radius
Sparsity level: vertex / cluster degrees
Evaluation criteria
Locality - sparsity tradeoff:
locality and sparsity parameters
go opposite ways:
better sparsity ⇔ worse locality
(and vice versa)
Evaluation criteria
Locality measures
Weighted distances:
Length of path (e1,...,es) = ∑1<i< s w(ei)
dist(u,w,G) = (weighted) length of shortest path
dist(U,W) = min{ dist(u,w) | uU, wW }
Evaluation criteria
Diameter, radius: As before, except weighted
For clusters collection S:
• Diam(S)=maxi Diam(Si)
• Rad (S)=maxi Rad (Si)
Sparsity measures
Cover sparsity measure - overlap:
deg(v,S) = # occurrences of v in clusters SS
i.e., degree of v in hypergraph (V,S)
DC(S) = maximum degree of cover S
AvD(S) = average degree of S
= ∑vV deg(v,S) / n
= ∑SS |S| / n
v
deg(v) = 3
Partition sparsity measure - adjacency
Intuition: “contract” clusters into super-nodes,
look at resulting cluster graph of S,
G(S)=(S, E),
E={(S,S') | S,S‘ S,
G contains edge (u,v) for u  S and v  S'}
E edges:
inter-cluster edges
Example: A basic construction
Goal: produce a partition S with:
1. clusters of radius < k
2. few inter-cluster edges (or, low AvDc(S))
Algorithm BasicPart
Algorithm operates in iterations,
each constructing one cluster
Example: A basic construction
At end of iteration:
- Add resulting cluster S to output collection S
- Discard it from V
- If V is not empty then start new iteration
Iteration structure
• Arbitrarily pick a vertex v from V
• Grow cluster S around v, adding layer by layer
• Vertices added to S are discarded from V
Iteration structure
• Layer merging process is carried repeatedly
until reaching required sparsity condition:
- next iteration increases # vertices
by a factor of < n1/k
(I.e., |G(S)| < |S| n1/k)
Analysis
Thm: Given n-vertex graph G(V,E), integer k > 1,
Alg. BasicPart creates a partition S satisfying:
1) Rad(S) < k-1,
2) # inter-cluster edges in G(S) < n1+1/k
(or, AvDc(S) < n1/k)
Analysis
Proof:
Correctness:
• For every S added to S is (connected) cluster
• The generated clusters are disjoint
(Alg' erases from V every v added to cluster)
• S is a partition (covers all vertices)
Analysis (cont)
Property (2):
By termination condition of internal loop,
resulting S satisfies |G(S)| < n1/k |S|
(# inter-cluster edges touching S) < n1/k |S|
Number can only decrease in later iterations, if
adjacent vertices get merged into same cluster

|E| < ∑SS n1/k |S| = n1+1/k
Analysis (cont)
Property (1):
Consider iteration of main loop.
Let J = # times internal loop was executed.
Let Si = S constructed on i'th internal iteration
|Si| > n(i-1)/k for 2 < i < J
(By induction on i)
Analysis (cont)
J<k
(otherwise, |S| > n)
Note: Rad(Si) < i-1 for every 1 < i < J
(S1 is composed of a single vertex,
each additional layer increases Rad(Si) by 1)
 Rad(SJ) < k-1
Synchronizers revisited
Goal: Synchronizer capturing reasonable middle
points on time-communication tradeoff scale
Synchronizer g
Assumption: Given a low-degree partition S
For each cluster in S, build rooted spanning tree.
In addition, between any two neighboring
clusters designate a synchronization link.
Synchronizer g
Handling safety information (in Phase B)
Step 1: For every cluster separately apply
synchronizer b
(By end of step, every v knows every w in its
cluster is safe)
Step 2: Every processor incident to
synchronization link sends a message to other
cluster, saying its cluster is safe.
Handling safety information (in Phase B)
Step 3: Repetition of step 1, except the
convergecast performed in each cluster carries
different information:
• Whenever v learns all clusters neighboring
its subtree are safe, it reports this to parent.
Step 4: When root learns all neighboring
clusters are safe, it broadcasts “start new pulse”
on tree
Synchronizer g
Phases of synchronizer g
In each cluster
1. Converge(Æ,Safe(v,p))
2. Tcast(ClusterSafe(p))
3. Send ClusterSafe(p) messages to adjacent
clusters
4. Converge(Æ,AdjClusterSafe(v,p)),
5. Tcast(AllSafe(p))
Analysis
Correctness:
Recall:
Readiness property:
Processor v is ready for pulse p once it already
received all alg' msgs sent to it by neighbors
during their pulse p-1.
Readiness rule:
Processor v may generate pulse p once it
finished its original actions for pulse p-1, and
Ready(v,p) holds.
Analysis
To prove Sync. g properly implements Phase B,
need to show that it imposes readiness rule.
Claim: Synchronizer g is correct.
Complexity
Claim:
1. Cpulse(g)=O(n1+1/k)
2. Tpulse(g)=O(k)
Proof:
Time to implement one pulse:
< 2 broadcast / convergecast rounds in clusters
(+ 1 message-exchange step among border
vertices in neighboring clusters)
 Tpulse(g) < 4 Rad(T) +1 = O(k)
Complexity
Messages: Broadcast / convergecast rounds,
separately in each cluster,
cost O(n) msgs total
(clusters are disjoint)
Single communication step among neighboring
clusters requires n AvDc(T) = O(n1+1/k) msgs
 Cpulse(g) = O(n1+1/k)