Part11:Checkpointing 3 - University of Massachusetts Amherst

Transcript Part11:Checkpointing 3 - University of Massachusetts Amherst

UNIVERSITY OF MASSACHUSETTS
Dept. of Electrical & Computer Engineering
Fault Tolerant Computing
ECE 655
Checkpointing III
Fall 2006
ECE655/Ckpt Part.12 .1
Copyright 2004 Koren & Krishna
Coordinated Checkpointing Algorithms
 Uncoordinated checkpointing may lead to domino
effect or to livelock
Two basic approaches to checkpoint coordination:
 The Koo-Toueg algorithm, which has a process to initiate
the system-wide checkpointing process
 An algorithm which staggers checkpoints in time;
Staggering checkpoints can help avoid near-simultaneous
heavy loading of the disk system
 Communication-induced checkpointing procedures
 Simultaneously using coordinated and uncoordinated
checkpointing algorithms - the latter is sufficient
to deal with most isolated failures
ECE655/Ckpt Part.12 .2
Copyright 2004 Koren & Krishna
Koo-Toueg Algorithm
 Suppose P wants to establish
a checkpoint at P_3
This will record that q1 was
received from Q - to prevent q1
from being orphaned, Q must checkpoint as well
 Thus, establishing a checkpoint at P_3 by P forces
Q to take a checkpoint to record that q1 was sent
 An algorithm for such coordinated checkpointing has
two types of checkpoints - tentative and permanent
 P first records its current state in a tentative
checkpoint, then sends a message to all other
processes from whom it has received a message
since taking its last checkpoint
 Call the set of such processes 
ECE655/Ckpt Part.12 .3
Copyright 2004 Koren & Krishna
Koo-Toueg Algorithm - Cont.
 The message tells each process in  (e.g., Q), the
last message, m_qp, that P has received from it
before the tentative checkpoint was taken
 If m_qp was not recorded in a checkpoint by Q: to
prevent m_qp from being orphaned, Q is asked to
take a tentative checkpoint to record sending m_qp
 If all processes in , that need to, confirm taking a
checkpoint as requested, then all tentative
checkpoints can be converted to permanent
If some members of , are unable to checkpoint as
requested, P and all members of  abandon the
tentative checkpoints, and none are made permanent
 This may set off a chain reaction of checkpoints
Each member of  can potentially spawn a set of
checkpoints among processes in its corresponding set
ECE655/Ckpt Part.12 .4
Copyright 2004 Koren & Krishna
Staggered Checkpointing
 The Koo-Toueg algorithm - and others like it -
can lead to a large number of processes taking
checkpoints at nearly the same time
 If they are all writing to a shared stable storage,
e.g., a set of common disks, this surge can lead
to congestion at the disks or network or both
 Either of two approaches can be used to ensure
that, at any time, at most one process is taking
its checkpoint
 (1) Write the checkpoint into a local buffer, then
stagger the writes from buffer to stable storage
 Assuming a buffer of sufficiently large capacity
 (2) Try staggering the checkpoints in time
ECE655/Ckpt Part.12 .5
Copyright 2004 Koren & Krishna
Staggered Checkpointing - Cont.
 Staggered checkpoints may not be consistent -
there may be orphan messages in the system
 This can be avoided by a coordinating phase in
which each process logs in stable storage all
messages it sent out since its previous checkpoint
 The message-logging phase of the processes will
overlap in time
 If the volume of messages is less than the size
of the individual checkpoints - the disks and
network will see a reduced surge
ECE655/Ckpt Part.12 .6
Copyright 2004 Koren & Krishna
Recovery From Failure
 If a process fails, it can be restarted after
rolling it back to its last checkpoint and all the
messages stored in log played back
This combination of checkpoint and message log is
called a logical checkpoint
 The staggered checkpointing algorithm guarantees
that all the logical checkpoints form a consistent
recovery line
ECE655/Ckpt Part.12 .7
Copyright 2004 Koren & Krishna
Phase One of Staggering Algorithm
 Phase 1 - the checkpointing phase:
 for (i=0; i n-1; i++) {
 P_i takes a checkpoint
 P_i sends a message to P_{(i+1) mod n}, ordering the latter
to take a checkpoint
 }
 When P_0 gets a message from P_{n-1} ordering it
to checkpoint - this is the cue for P_0 to initiate
the second (message-logging) phase
 It sends out a marker message on each of its
outgoing channels. When a process P_i receives a
marker message, it goes to phase 2
ECE655/Ckpt Part.12 .8
Copyright 2004 Koren & Krishna
Phase Two of Staggering Algorithm
 Message Logging Phase
 if (no previous marker message was received in this round
by P_i) then {
 P_i sends a marker message on each of its outgoing
channels
 P_i logs all the messages received by it after the
preceding checkpoint
 }
 else
 P_i updates its message log by adding all the messages
received by it since the last time the log was updated
 end if
ECE655/Ckpt Part.12 .9
Copyright 2004 Koren & Krishna
Example of
Staggering Algorithm
- Phase One
system
P0 takes a checkpoint and sends take_checkpoint
order to P1
 P1 sends such an order to P2 after taking its own
checkpoint
P2 sends a take_checkpoint order back to P0
 At this point, each of the processes has taken a
checkpoint and the second phase can begin
ECE655/Ckpt Part.12 .10
Copyright 2004 Koren & Krishna
Example - Phase 2
 P0 sends message_log
to P1 and P2 - logging
messages they received
since last checkpoint
 P1 and P2 send out
similar message_log orders
Each time such a message is received - the
process logs the messages
 If it is the first time such a message_log order
is received by it - the process sends out marker
messages on each of its outgoing channels
ECE655/Ckpt Part.12 .11
Copyright 2004 Koren & Krishna
Recovery
 Assumption - given the
checkpoint and messages
received, a process can
be recovered
 We may have orphan messages with respect to the
physical checkpoints taken in the first phase
 Orphan messages will not exist with respect to the
latest (in time) logical checkpoints that are
generated using the physical checkpoint and the
message log
ECE655/Ckpt Part.12 .12
Copyright 2004 Koren & Krishna
Time-Based Synchronization
 Orphan messages cannot happen if each process
checkpoints at exactly the same time
 Practically impossible - clock skews and message
communication times cannot be reduced to zero
 Time-based synchronization can still be used to
facilitate checkpointing - we have to take account
of nonzero clock skews
 Time-based synchronization - processes are
checkpointed at previously agreed times
 Example - ask each process to checkpoint when
its local clock reads a multiple of 100 seconds
 Such a procedure by itself is not enough to avoid
orphan messages
ECE655/Ckpt Part.12 .13
Copyright 2004 Koren & Krishna
Creation of an Orphan Message - Example
 Each process is checkpointing at time 1100 (local
clock)
 Skew between the two clocks is such that process
P0 checkpoints much earlier (in real time) then
process P1
As a result, P0 sends out a message to P1 after
its checkpoint, which is received by P1 before its
checkpoint
 This message is a potential orphan
ECE655/Ckpt Part.12 .14
Copyright 2004 Koren & Krishna
Preventing Creation of an Orphan Message
 Suppose the skew between any two clocks in the
distributed system is bounded by , and each process
is asked to checkpoint when its local clock reads 
 Following its checkpoint, a process Px should not send
out messages to any process Py until it is certain
that Py's local clock reads more than 
 Px should remain silent over the duration [,+]
(all times as measured by Px's local clock)
 If the inter-process message delivery time has a
lower bound  - to prevent orphan messages Px needs
to remain silent during a shorter interval [,+-]
If >, this interval is of zero length - no need for
Px to remain silent
ECE655/Ckpt Part.12 .15
Copyright 2004 Koren & Krishna
Different Method of Prevention
 Suppose message m is received by process Py when its
clock reads t
m must have been sent (by Px) no later than  earlier
- before Py's clock read t-
 Since the clock skew   , at this time, Px's clock
should have read at most t-+
 If t-+ <  , the sending of m would be recorded in
Px's checkpoint - m cannot be an orphan
 A message m received by Py when its clock reads at
least -+ cannot be an orphan
 Orphan messages can be avoided by Py not using and
not including in its checkpoint at  any message
received during [-+,] (Py's clock) until after taking
its checkpoint at 
ECE655/Ckpt Part.12 .16
Copyright 2004 Koren & Krishna
Diskless Checkpointing
 Memory is volatile and unsuitable for storing a
checkpoint
 However, with extra processors, we can permit checkpointing
in main memory
 By avoiding disk writes, checkpointing can be faster
 Best used as one level in a two-level checkpointing
 Have redundant processors using RAID-like techniques
to deal with failure
 Example: a distributed system with five executing,
and one extra, processors
 Each executing processor stores its checkpoint in its memory;
extra processor stores the parity of these checkpoints
 If an executing processor fails, its checkpoint can be
reconstructed from the remaining five plus parity checkpoints
ECE655/Ckpt Part.12 .17
Copyright 2004 Koren & Krishna
RAID-like Diskless Checkpointing
The inter-processor network must have enough
bandwidth for sending checkpoints
 Example: n executing and one checkpointing processor,
if all the executing processors send their checkpoints
to the checkpointing processor to
calculate parity - a potential hotspot
 Solution: Distribute the parity
computations
n=5
ECE655/Ckpt Part.12 .18
Copyright 2004 Koren & Krishna
Two-Level Recovery
 Coordinating checkpoints prevents orphan messages but
imposes overhead
 Will not affect correctness if failures are isolated, i.e., at
most one process in a failed/recovering state at any time
 The vast majority of failures are isolated
 Make recovery from isolated failures fast
 Accept longer recovery times for simultaneous failures
 This suggests a two-level recovery scheme
 First level: each process takes its own checkpoints without
coordination (only useful when recovering from isolated
failures)
 Checkpoint need not be written to disk, can be written into a
memory of another processor
 Second level: occasionally entire system undergoes a
coordinated checkpointing (with higher overhead), which
guards against non-isolated failures
ECE655/Ckpt Part.12 .19
Copyright 2004 Koren & Krishna
Two-Level Recovery Example
 P0 fails at t0;
system rolls back
to latest first-level
checkpoint;
Recovery successful;
 P1 fails at t1; rolls back;
At point tx (during recovery), P2 also fails
Non-isolated failures - the system rolls back both
processes to the latest second-level checkpoint
 In general, the more common the non-isolated
failures, the greater must be the frequency at
which the second-level checkpoint is taken
ECE655/Ckpt Part.12 .20
Copyright 2004 Koren & Krishna
Message Logging
 To continue computation beyond latest checkpoint,
recovering process may require all the messages it
received since then, played back in original order
 For coordinated checkpointing - each process can be
rolled back to its latest checkpoint and restarted:
those messages will be resent during reexecution
 To avoid the overhead of coordination and let
processes checkpoint independently, logging messages
is an option
 Two approaches to message logging:
 Pessimistic logging - ensures that rollback will not spread,
i.e., if a process fails, no other process will need to be
rolled back to ensure consistency
 Optimistic logging - a process failure may trigger rollback of
other processes as well
ECE655/Ckpt Part.12 .21
Copyright 2004 Koren & Krishna
Pessimistic Message Logging
 Simplest approach - the receiver of a message
stops whatever it is doing when it receives a
message, logs the message onto stable storage,
then resumes execution
 Recovering a process from failure - roll it back to
its latest checkpoint and play back to it the
messages it received since that checkpoint, in the
right order
 No orphan messages will exist - every message will
either have been received before the latest
checkpoint or explicitly saved in the message log
 Rolling back one process will not trigger the
rollback of any other process.
ECE655/Ckpt Part.12 .22
Copyright 2004 Koren & Krishna
Sender-Based Message Logging
 Logging messages into stable storage can impose a
significant overhead
 Against one isolated failure at a time, sender-based
message logging can be used
 The sender of a message records it in a log - when required,
the log is read to replay the message
 Each process has send- and receive-counters, which increment
every time the process sends or receives a message
 Each message has a Send Sequence Number (SSN) - value of
the send-counter when it is transmitted
 A received message is allocated a Receive Sequence Number
(RSN) - value of the receive-counter when it was received
 The receiver also sends out an ACK to the sender, including
the RSN it has allocated to the message
 Upon receiving this ACK, the sender acknowledges the ACK in
a message to the receiver
ECE655/Ckpt Part.12 .23
Copyright 2004 Koren & Krishna
Sender-Based Message Logging - Cont’d
 Between the time that the receiver receives the message
and sends its ACK, and when it receives the sender's ACK
of its own ACK, the receiver is forbidden to send messages
to other processes - essential to maintaining correct
functioning upon recovery
 A message is said to be fully-logged when the sending node
knows both its SSN and its RSN; it is partially-logged
when the sending node does not yet know its RSN
 When a process rolls back and restarts computation from
the latest checkpoint, it sends out to the other processes
a message listing the SSN of their latest message that it
recorded in its checkpoint
 When this message is received by a process, it knows which
messages are to be retransmitted, and does so
 The recovering process now has to use these messages in
the same order as they were used before it failed - easy
to do for fully-logged messages, since their RSNs are
available, and they can be sorted by this number
ECE655/Ckpt Part.12 .24
Copyright 2004 Koren & Krishna
Partially-logged Messages
 Remaining problem - the partially-logged messages,
whose RSNs are not available
They were sent out, but their ACK was never received
by the sender
 The receiver failed before the message could be
delivered to it, or it failed after receiving the message
but before it could send out the ACK
 The receiver is forbidden to send out messages of its
own to other processes between receiving the message
and sending out its ACK
As a result, receiving the partially-logged messages in
a different order the second time cannot affect any
other process in the system - correctness is preserved
 Clearly, this approach is only guaranteed to work if
there is at most one failed node at any time
ECE655/Ckpt Part.12 .25
Copyright 2004 Koren & Krishna
Optimistic Message Logging
 Optimistic message logging has a lower overhead
than pessimistic logging; however, recovery from
failure is much more complex
Optimistic logging is of theoretical interest
 When messages are received, they are written into
a volatile buffer which, at a suitable time, is copied
into stable storage
Process execution is not disrupted, and so the
logging overhead is very low
 Upon failure, the contents of the buffer can be lost
leading to multiple processes having to be rolled
back
 We need a scheme to handle this situation
ECE655/Ckpt Part.12 .26
Copyright 2004 Koren & Krishna
Checkpointing in Shared-Memory Systems
A variant of CARER for shared-memory bus-based
multiprocessors - each processor has its own cache
 Change the algorithm to maintain cache coherence
among the multiple caches
Instead of the single bit marking a line as
unchangeable, we have a multi-bit identifier:
 A checkpoint identifier, C_id with each cache line
 A (per processor) checkpoint counter, C_count,
keeping track of the current checkpoint number
ECE655/Ckpt Part.12 .27
Copyright 2004 Koren & Krishna
Shared Memory - Cont.
 To take a checkpoint, increment the counter
 A line modified before will have its C_id less than
the counter
 When a line is updated, set C_id = C_count
 If a line has been modified since being brought into
the cache and C_id < C_count, the line is part of
the checkpoint state, and is therefore unwritable.
Any writes into such a line must wait until the line
is first written into the main memory.
 If the counter has k bits, it rolls over to 0 after
reaching 2 K-1
ECE655/Ckpt Part.12 .28
Copyright 2004 Koren & Krishna
Bus-Based Coherence Protocol
 Modify a cache coherence algorithm to take account
of checkpointing
 All traffic between caches and memory must use the
bus, i.e., all caches can watch the traffic on bus
 A cache line can be in one of the following states:
invalid, shared unmodified, exclusive modified, and
exclusive unmodified
Exclusive - this is the only
valid copy in any cache;
 Modified - line has been
modified since it was
brought into cache from
memory
ECE655/Ckpt Part.12 .29
Copyright 2004 Koren & Krishna
Bus-Based Coherence
Protocol - Cont’d
 If processor wants to update
a line in shared unmodified
state, it moves into exclusive
modified state
Other caches holding the same line must
invalidate their copies - no longer current
 When in the exclusive modified or exclusive unmodified
states, another cache puts out a read request on the
bus, this cache must service that request (only current
copy of that line)
 Byproduct- memory is also updated if necessary
 Then, move to shared unmodified
Write miss, line into cache - exclusive modified
ECE655/Ckpt Part.12 .30
Copyright 2004 Koren & Krishna
Bus-Based Coherence and checkpointing
Protocol
 How can we modify this protocol to account for
checkpointing?
 The original exclusive modified state now splits into
two:
 Exclusive modified
 Unwritable
ECE655/Ckpt Part.12 .31
Copyright 2004 Koren & Krishna
Directory-Based Protocol
 In this approach a directory is maintained centrally
which records the status of each line
 We can regard this directory as being controlled by
some shared-memory controller
 This controller handles all read and write misses and
all other operations which change line state
 Example: If a line is in the exclusive unmodified
state and the cache holding that line wants to modify
it, it notifies the controller of its intention
 The controller can then change the state to exclusive
modified
It is then a simple matter to implement this
checkpointing scheme atop such a protocol
ECE655/Ckpt Part.12 .32
Copyright 2004 Koren & Krishna
Other Uses of Checkpointing
 (1) Process Migration
 A checkpoint represents process state - migrating a process
from one processor to another means moving the checkpoint,
and computation can resume on the new processor - can be used
to recover from permanent or intermittent faults
 Nature of checkpoint determines whether the new processor
must be of the same model and run the same operating system
 (2) Load-balancing
 Better utilization of a distributed system by ensuring that the
computational load is appropriately shared among the processors
 (3) Debugging
 Core files are dumped when a program exits abnormally these are essentially checkpoints, containing full state
information about the affected process - debuggers can read
core files and aid in the debugging process
 (4) Snapshots
 Observing the program state at discrete epochs - deeper
understanding of program behavior
ECE655/Ckpt Part.12 .33
Copyright 2004 Koren & Krishna