State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey Introduction    State machines provide fault-tolerance through replication. They consist of state variables and commands to change the.

Transcript State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey Introduction    State machines provide fault-tolerance through replication. They consist of state variables and commands to change the.

State Machines
CS 614
Thursday, Feb 21, 2002
Bill McCloskey
Introduction



State machines provide fault-tolerance
through replication.
They consist of state variables and
commands to change the state.
Clients request the state machine to
execute commands.
State A
Command
Client
State B
An Example: Memory
State variables:
store: array[0..n] of word
Commands:
read: command(loc: 0..n)
send store[loc] to client
write: command(loc: 0..n, value: word)
store[loc] := value
Reads and writes values to and from storage.
Ordering of Commands
The machine will have multiple
clients.
 Commands from the same client
must be executed in the order they
were issued.
 Commands from different clients
must be executed in an order
determined by causality.

Fault-tolerance


Replicas of state machine are run on multiple
processors for fault-tolerance.
Replicas must start in the same initial state and
must process the same set of requests in the
same order. This is a consensus problem.



Agreement: Every non-faulty replica receives the
every request.
Order: Every non-faulty replica processes the requests
in the same order.
There are several ways of achieving these
conditions…
Agreement




Often, agreement is met using a Byzantine
Agreement protocol. Every non-faulty processor
will receive each command.
Clients can transmit commands to the replicas,
or a single replica can serve as transmitter for
the client.
More efficient techniques for fail-stop failures are
can be used instead.
Other techniques are also possible, such as the
Paxos algorithm (more later).
Order and Stability





A request can be labeled with a unique ID (uid).
The request is considered stable at a certain
state machine replica when requests with lower
uids cannot be received from correct clients.
The replica must wait until a request is stable
before executing it.
So, a state machine processes requests in order
of uids. Therefore, uid ordering is constrained by
causality of requests (from earlier).
Possible stability tests use Lamport clocks or
real-time clocks. Replicas may also agree on a
uid using agreement.
Achieving Stability with
Lamport Clocks





Messages marked with a logical timestamp. This
is its uid.
Causality requirements are satisfied.
Clients must periodically make “null” requests.
A request is stable at a replica when a request
with a larger timestamp has been received from
every client. Then no lower uids can arrive.
Requires FIFO channels (easy). Works in the
presence of fail-stop failures.
Achieving Stability with
Real-Time Clocks




Real-time clock value, together with the identity
of the sending process, is the uid.
To satisfy causality, a client can make only one
request per clock tick, and message delivery
must take longer than the difference between
clocks on different processors.
Let  be the time for a request to reach every
correct processor.
A request is stable if its timestamp is at least 
time units in the past, according to the local
clock. This imposes a delay in processing.
Replica-Generated Uids





Ordering using Lamport clocks requires all
processors to communicate (null requests).
Real-time clock ordering requires clock
synchronization, also expensive.
Could also have the replicas themselves agree
on a uid for each request.
Each replica proposes a candidate uid. The
replicas agree on a uid, accepting the request.
Clients cannot execute a request until the
previous request is accepted, to guarantee
causality.
Implementing ReplicaGenerated Uids





Final uid is always at least the candidate uid.
A request r’ seen after a request r has been
accepted has a higher candidate uid than the
final uid of r.
A new candidate uid will be one greater than any
candidate or final uid so far, plus a factor of i/N
to make it unique.
Each replica broadcasts its candidate uid.
The final uid is selected as the maximum of all
the uids received.
Paxos: Another Approach



Lamport’s Paxon Synod is an agreement
algorithm.
It is efficient and practical.
It assumes a partially synchronous model.




Processors may fail silently.
Guarantees:



Messages are not always delivered on time.
Messages may be lost or duplicated.
Agreement: Everyone agrees on the same value.
Validity: The chosen values was one of the candidates.
Termination is not guaranteed.
Stability

An execution fragment  is stable if:





No processors fail or recover in .
No packets are lost or duplicated in .
Delivery of messages is on time.
 is nice if it is stable and if a majority of
processes are alive.
We’ll see that Paxos terminates if there is an
execution fragment which is nice for long
enough.
Leader Election





Paxos requires a leader to “run” the algorithm.
Processes exchange “Alive” messages to try to
detect failures.
When the current leader fails, a new one is
selected which has the largest processor ID.
Failure detector doesn’t always work, so there
may be multiple leaders or no leader.
The algorithm may not terminate if there are
always too many leaders.
Setup






The algorithm operates in a sequence of rounds.
Multiple rounds may be ongoing at the same
time.
In each round, the leader tries to get a majority
for a certain value.
Processes vote in each round.
If a majority of processes vote in a round, then
the value chosen is the one proposed by the
leader.
If too few processes vote in the round, it fails
and a new one is started.
Rounds




Each round is numbered with a tuple (l, r),
where l is the process ID of the leader and r is
the leader’s index for that round.
Lexicographic ordering is used on the rounds.
This way, round numbers are unique.
Thus, each round has a unique value, since a
leader only proposes one value, and a round
only has one leader.
Algorithm
1.
2.
3.
4.
5.
Leader informs other replicas that round R is
starting.
Each replica finds the last round before R in
which it voted. It sends this vote to the leader.
The leader waits for these votes from a
majority set (quorum) Q.
Based on these previous votes, the leader
decides to propose a certain value v for the
new round, and informs the replicas in Q.
Each replica may vote in this round or not. If
they choose to vote, they send the vote to the
leader.
Algorithm (2)
6.
If the leader receives a vote from every replica
in Q, it informs everyone that v is the
consensus value.
Voting




Why would a replica ever decide not to vote?
When it gives its last vote (which, say, was in
round R’) to a leader in round R, then it must
not vote in any round from R’ to R, since that
would invalidate the information it sent.
This means that if leaders keep starting new
rounds, everyone will be forbidden to vote and
the algorithm will never terminate.
If a majority of processes are forbidden to vote
in a round, the round is dead. A dead round can
never succeed.
Example of Paxos
Leader
Message
Replica
Ask for previous
votes
Query(R) 
Find R’ as previous
round
Choose value v to
keep round
anchored
 Report(R’)
Forbid votes for
rounds in (R’, R)
Send value
Vote(R, v) 
Check if forbidden
 Voted(R, v)
Send vote if not
Choose round R
If majority voted,
send outcome
Outcome(R, v) 
Anchored Rounds





Let vR be the value that the leader proposes for
round R. (We saw that this is well defined.)
If no quorum is found, vR = null.
A round is anchored if all rounds before it are
either dead or have the same value vR. An
anchored round stays anchored (stable).
Paxos will be set up so that every round is
anchored or has vR = null.
This implies that any two successful rounds have
the same value…
Any Two Successful Rounds
Have the Same Value



for all R, vR = null or R is anchored
for all R, R’  R, if R’ is not dead then vR = null or vR=vR’
for all R, R’  R, if R’ is successful then vR = null or vR=vR’
Any two successful rounds have the same value
This is the essential property that we needed.
It tells us that once there is agreement by a
majority, all future rounds will agree on the
same value. Now we need to assure that all
rounds will be anchored.
Anchoring the Rounds





When the leader has received the most recent votes (and
the values voted for) from a majority of replicas, it must
propose a value for the current round R, to keep it
anchored.
It looks through previous rounds from R, skipping over
those in which no value was reported. These rounds must
be dead, since a majority chose not to vote for them.
When it gets to a round R’ with a value, it chooses the
same value for the new round.
Since R’ was anchored, and all rounds between R and R’ are
dead, R’ is anchored.
If it finds no R’, it can choose the value to be its initial value
(given as part of agreement).
An Example
Value
A Voted
Round 1
7
X
Round 2
8
X
Round 3
9
B Voted
C Voted
All rounds are dead. So with complete
information, leader could choose any value.
Q = {A,B}: Round 4 will use value 8
Q = {A,C}: Round 4 will use value 9
Q = {B,C}: Round 4 will use value 9
X
Another Example
Value
A Voted
Round 1
7
X
Round 2
8
X
Round 3
8
B Voted
C Voted
X
X
Round 2 succeeds. Rounds 1 and 3 are dead.
Q = {A,B}: Round 4 will use value 8
Q = {A,C}: Round 4 will use value 8
Q = {B,C}: Round 4 will use value 8
Summary of Paxon Synod

This completes the proof that Paxos is correct.



Validity: Leaders always propose values that they
were given or that were proposed before.
Agreement: The leader sends the consensus result to
everyone. Even if more rounds take place, they’ll
always produce the same value.
Now we have an agreement algorithm which
works in a realistic environment, but which may
not terminate when failures occur.
The Paxon Parliament




Paxos agrees on a single value. For state
machines, we need to agree on the commands
to execute.
We can consider a numbered list of commands
which will be executed. The identity of these
commands will be decided by consensus.
A single leader will run an instance of Paxos for
each index.
For a finite number of indices, the leader is
forced to pick commands based on previous
voting. For the rest, it chooses commands as
they come from the client.
Summary




A client must wait for a command to reach
consensus before requesting another, to satisfy
causality.
Read operations can be satisfied by checking the
local state or by executing a read command,
which guarantees proper ordering.
Lamport proposes many other optimizations.
Ideally, agreeing on a command takes 3n
messages for n replicas.

State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey Introduction    State machines provide fault-tolerance through replication. They consist of state variables and commands to change the.

Transcript State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey Introduction    State machines provide fault-tolerance through replication. They consist of state variables and commands to change the.

Directory