Compiler Assisted Checkpointing/Uncoordinated checkpointing

Download Report

Transcript Compiler Assisted Checkpointing/Uncoordinated checkpointing

Checkpointing 2.0
Compiler-Assisted Checkpointing
Uncoordinated Checkpointing
Compiler-Assisted Checkpointing

Compiler-Assisted Checkpointing (1994)


Beck, Plank, Kingsley (UTK)
Compiler-Assisted Memory Exclusion for
Fast Checkpointing (1995)

Beck, Plank, Kingsley (UTK)
Motivation



We saw that memory exclusion can
dramatically reduce the size (and overhead)
of a checkpoint file
Can be time consuming, and wrong decisions
can cause a program to be incorrect on
recovery
Use a compiler to determine what memory
can be excluded, automating and ensuring
the correctness of the process
Compiler Directives


Programmer add the following
directives to the code
CHECKPOINT_HERE


Direct translation to checkpoint_here()
EXCLUDE_HERE

Include exclude_byte() and include_byte()
calls at that location
Directive Placement


Poor placement of directives might lead
to inefficient checkpointing
However, the program will still
checkpoint and recover properly
Directive Placement

Why not place EXCLUDE_HERE directly
before all CHECKPOINT_HERE
directives?
for(…) {
EXC…
EXC…
for(…) {
CHE…
CHE…
}
}
Overview of Technique


Perform some data flow analysis of the
program to determine which variables
are clean or dead at each
EXCLUDE_HERE statement
Insert the appropriate exclude_byte()
calls at each location
Build a Control Flow Graph

A control flow graph G=<N,E> is a
directed graph, where each node
represents a program statement, and
each edge represents a possible flow of
control from one statement to another
Example Program
S1:
S2:
S3:
S4:
S5:
S6:
S7:
S8:
S9:
100
INTEGER I, X, Y, Z
Z=3
X=5
FOR 100, I = 1,1000
Y=X+Z
X=X*Y
EXCLUDE_HERE
CHECKPOINT_HERE
CONTINUE
END
Example CFG
S1
S2
S3
S4
S9
S8
S5
S7
S6
Find Sub-graphs

Given the CFG, G, of our program, find
all sub-graphs, G’, where G’ is rooted by
an EXCLUDE_HERE and contains all
paths reachable from that
EXCLUDE_HERE that do not pass
through another EXCLUDE_HERE
Example G’
S1
S2
S3
S4
S9
S8
S5
S7
S6
Strategy

For each G’, calculate two sets



DE(G’) – all variables that are dead at every
CHECKPOINT_HERE in G’
RO(G’) – all variables that are read-only
throughout G’
At each EXCLUDE_HERE insert calls to



exclude_bytes(v, CKPT_DEAD) for all v in DE(G’)
exclude_bytes(v, READ_ONLY) for all v in RO(G’)
include_bytes(v) for all v that are not in DE(G’)
nor in RO(G’)
Determine Memory Accesses

For each statement, S, determine the
membership of three sets



MAY_REF(S) – every location that may be
referenced by some execution of S
MAY_DEF(S) – every location that may be
defined by some execution of S
MUST_DEF(S) – every location that will be
defined by every execution of S
An Aside

Because our example has no arrays, no
pointers, etc. MUST_DEF(S) and
MAY_DEF(S) will be the same set
Example
S1
S2
{},{Z}
{},{X}
S3 {},{I}
S4 {X,Z},{Y}
{REF},{DEF}
{I},{I}
S9
S8
S5 {X,Y},{X}
S7
S6
{},{}
{},{}
{},{}
Liveness/Deadness

v is ‘live’ at S if there is a path from S to
some S’ s.t. v  MAY_REF(S’) and for all S’’
on the path, v  MUST_DEF(S’’)


v must be live at S if it is read at some (later) S’
without being re-defined somewhere between the
two
If v is not alive at S, we say it is dead at S
DEAD(S)

The set DEAD(S) is the set of variables that
are dead immediately before the execution of
S


v  DEAD(S) if v is dead everywhere below S or it
is redefined at S, except if ref’d at S
We calculate DEAD(S) with an iterative
algorithm:
1. For all S, set DEAD(S) = V
2. For every statement S,
DEAD(S) = (S’ DEAD(S’))  MUST_DEF(S) – MAY_REF(S)
3. Repeat step 2 until all DEAD(S) converge
Data Flow Eqn. For DEAD

DEAD(S)
Fs(X) = { V
{ X  MUST_DEF(S) – MAY_REF(S)
where X is S’ DEAD(S’)
if S is END
otherwise,
The Set DE(S)


The set DE(S) is the set of all variables
that are dead at every
CHECKPOINT_HERE below S, in the
same subgraph
Calculate iteratively, as before
Fs(X) = { X  DEAD(S)
*{ V
{X
where X is S’ DE(S’)
if S is CHECKPOINT_HERE
if S is EXCLUDE_HERE or END
otherwise
The Set RO(S)


v is read-only at S if v  MAY_DEF(s)
The set RO(S) is the set of variables
that are read-only along all paths from
S in the same sub-graph
Fs(X) = { V
{ X – MAY_DEF(S)
if S is EXCLUDE_HERE or END
otherwise
Solution to Example



DE(G’) and RO(G’) are defined to be DE(S)
and RO(S) where S is the statement directly
following the EXCLUDE_HERE
For our example DE(7) = {Y}, RO(7) = {Z}
S6 would become
exclude_bytes(Y, CKPT_DEAD)
exclude_bytes(Z, CKPT_READONLY)
include_bytes(everything else)
Uncoordinated Distributed
Checkpoints


Q: How can we extend our uniprocessor
checkpointing to a distributed system?
A: Each process in the distributed system
takes an independent checkpoint
Global State


The global state is a collection of the
states of each of the individual
processes (and of the communication
channels)
A consistent global state is one which
that may occur during a failure-free,
correct running of the computation
Consistent States

are states that may have occurred
p
q
p
q
Inconsistent States

are states that could not have occurred
p
q

Here processor p has received a message
that has not been sent
Inconsistent States


Inconsistent states can only occur
where there have been failures, and the
processes have been restarted from
their checkpoints
A rollback-recovery system must insure
that the system is restarted in a
consistent state

but not necessarily a state that has ever
occurred
Consistent Global Checkpoint


A consistent global checkpoint is set of
checkpoints, one from each process,
that correspond to a consistent global
state
If processes take their checkpoints
independently, they must search for a
consistent global state upon restart
The Domino Effect

In the event of a failure, ideally we
would like to only roll back the failed
process; however, doing so might leave
the system in an inconsistent state,
necessitating that others be rolled back
as well
Example
A
p
1
3
q
r

2
4
7
6
5
B
8
C
*
If r fails, and restarts at C, message 8 must be invalidated,
forcing q to rollback to B. Msg. 7 is now invalidated, forcing p
to rollback to A, etc., all the way to the beginning
Calculating the Recovery Line


The recovery line for an uncoordinated
system is the set of the “latest” checkpoints
for each process in the system that is
consistent
In order to calculate the RL after a failure,
the processes record the dependencies
among their checkpoints during failure-free
operation
Protocol





Let ci,x be the xth checkpoint of process Pi
Let Ii,x denote the interval between checkpoints ci-1,x
and ci,x
If Pi sends a message, m, to Pj during interval Ii,x, Pi
will piggy-back (i,x) on m
If Pj receives m during Ij,y, it will record the
dependence of Ij,y on Ii,x, and later save it in
checkpoint cj,y
If Pi fails, on recovery, all the other processes will
send their dependency information to Pi, who will use
that info to calculate the recovery line
Checkpoint Dependency Graph



Pi takes the dependency information and
constructs a dependency graph
The nodes of the graph are all of the ca,b, and
the current state of all un-failed processes
A directed edge is drawn from ci,x-1 to cj,y if



i  j and a message was sent from Ii,x to Ij,y
i = j and y = x
An edge from ci,x-1 to cj,y implies that cj,y
contains a message received not marked as
sent in ci,x-1
Example
p
q
r
*
Algorithm
Include last ckpt of each failed P in RecoverySet
Include current state of un-failed P in RecoverySet
Mark all ckpts. reachable from any node in RS
While(at least one node in RS is marked)
Replace each marked RS element with the latest unmarked ckpt
of the same process
Mark all ckpts. reachable from any node in RS
Finding the Recovery Set
X
X
X
Finding the Recovery Set
X
X
X
X
X
X
The Recovery Line
p
q
r
*