Checkpointing and Recovery

Transcript Checkpointing and Recovery

Checkpointing and Recovery
Purpose
• Consider a long running application
– Regularly checkpoint the application
• Expensive task
– In case of failure, restore to the previous checkpoint
• What happens in case of a distributed application
– One (or more) processes fail
– Restoration to previous checkpoint should be done
consistently
Examples
What to Save?
• Depends on application
– Could be as simple as just program counter
information
– Could be the state of the entire process,
including messages received, etc
Stable Storage
• Checkpoints must survive failure of
processes (including failure during a disk
write)
– A simple approach for stable storage
Approaches
• Asynchronous
– The local checkpoints at different processes are
taken independently
• Synchronous
– The local checkpoints at different processes are
coordinated
– They may not be at the same time
Asynchronous Checkpointing
• Problem
– Domino effect
Failed
process
Other Issues with Asynchronous
Checkpointing
• Useless checkpoints
• Need for garbage collection
• Recovery requires significant coordination
Asynchronous Checkpointing
(Continued)
• Identify dependency between different
checkpoint intervals
• This information is stored along with
checkpoints in a stable storage
• When a process repairs, it requests this
information from others to determine the
need for rollback
Two Examples of Asynchronous
Checkpointing
• Bhargava and Lian
• Wang et al
Algorithm by Bhargava et al
• Draw an edge from ci, x to cj,y if either
– i = j and y = x+1
– i  j and a message m is sent from Ii, x and
received in Ij, y
• Where Ii, x is the interval between ci, x-1 and ci, x
• Rollback recovery line used for recovery as
well as garbage collection
Algorithm by Wang et al
• Difference
– If a message sent from Ii, x is received in Ij, y then draw
an edge between cj, x-1 to cj, y
• Recovery line obtained is similar to that by by
Bhargava and Lian
• Advantage
– Number of useful checkpoints is at most N(N+1)/2
• This can be shown that the number of checkpoints that are
ahead of recovery line
Coordinated Checkpointing
• Using diffusing computation
– How can we use diffusing computation to
obtain a consistent snapshot?
Algorithm by Tamir and Sequin
• Blocking checkpoint
– A coordinator decides when a checkpoint is taken
– Coordinator sends a request message to all
– Each process
•
•
•
•
Stops executing
Flushes the channels
Takes a tentative checkpoint
Replies to coordinator
– When all processes send replies, the coordinator asks
them to change it to a permanent checkpoint
Algorithm by Tamir and Sequin
• How many checkpoints need to be stored
per process?
Checkpointing in Timed Systems
• If perfectly synchronized clocks?
Checkpointing in Timed Systems
• What if clocks are loosely synchronized?
– Max clock drift, , is known?
• All processes take a checkpoint at a fixed (local)
time
– After the checkpoint, a process does not send any
messages for 2
– The set of local checkpoints is guaranteed to be
consistent
Minimal Checkpoint
Coordination
• Approach by Koo and Toueg
– Require processes to take a checkpoint only if
they have to
Logging Protocols
• Pessimistic
• Optimistic
• Causal

Checkpointing and Recovery

Transcript Checkpointing and Recovery

Directory