Introduction to Distributed Systems

Download Report

Transcript Introduction to Distributed Systems

Paxos and
Replicated State
Machine (RSM)
Outline
Basic Concepts of Replicated State Machine
Paxos Made Simple
Replicated State Machine
We can replicate data, how can we guarantee it is correctly replicated?
How can we replicate computing?
Using replicated state machine.
Suppose the process execute operations deterministically (or can be
made deterministically).
If a group of server start with the same initial state and execute the
same sequence of operations, the final state should be the same.
Replicated State Machine
So, we can start a group of processes and made them to execute the
same sequence of operations. Thus using multiple processes can be
used for the purpose of fault tolerant.
Many uses:
lock servers as in the lab, reliable database, reliable replicated file systems
What is the crucial for the
RSMs?
The member in RSM should agree on the order of operation series.
Thus, when there are two or more alternative operations, the system
should decide which one should be chosen.
Decide means: each one in the RSM agree that they will perform a
specific operation and only that operation.
So, the problem can be reduced to a consensus problem:
◦ How can a group of processes can agree on “something”. Lets say here
something means the operations (values!) that will be taken by the RSMs.
The fundamental is how a group of process agree on a single value.
Consensus: decide a single
value
FLP tells us that with one faulty process it is impossible to achieve
distributed consensus.
◦ Michael J. Fischer, Nancy Lynch, and Michael S. Paterson. Impossibility of
distributed consensus with one faulty process. Journal of the ACM,
32(2):374–382, April 1985.
◦ FLP is valid for (all) general environment (asynchronous network)
◦ In practical, the situation is not that bad. We’ve spent a lot of money on
clusters and networks, they should not work that poor!
Paxos works for a partial asynchrony environment (which is a practical
assumption) and can achieve consensus eventually.
◦ e.g. works for the cluster environment.
◦ Paxos can be used to decide a value among a group of process (nodes,
computers)
◦ Lesli Lamport, Paxos Made Simple, 01 Nov 2001
Paxos
Paxos makes a group of processes agree on the same value despite
process failures, network failures, and network delays.
Thus, it can be used as the building block for RSM
◦ All processes in the group will decide the same value for the next operation.
To make the algorithm meaningful, we discard the consensus on trivial
solution: lock a predefined value in each process and make them just
accept that value.
The overall structure of Paxos
Assume a collection of processes that can propose values. (Where do
values come from? Someone has to make proposals. We call them as
Proposers.)
Someone should accept a proposal or reject it. Based on whether
accepters accept or reject a value, the value might be chosen. We call
them as Accepters.
If a value has been chosen, then processes should be able to learn the
chosen value. (Eventually someone will know that the system has decided
on a value. We call the persons to learn the current states of system as
Learners.)
No oracle, each process can only do the work based on the steps
described by the ‘possible’ algorithm, using the local states as well as the
messages received from others.
Proposers, Accepters as well as Learners (Agents) are the three roles in the
consensus algorithm. In an implementation, a single process may act as
more than one agent.
Again, the goal
The goal is to ensure that some proposed value is eventually chosen
and, if a value has been chosen, then processes can eventually learn
the value.
(Chosen means something (the single value) has been decided (locked
in the system.)
And eventually means no time bound we can know that a value is
chosen
Recall the two properties of
distributed algorithms
Safety (Correctness)
◦ Bad things never happen.
◦ Any process in the group should not decide a different value than others.
◦ The value should be meaningful. ( NOP for all operations! Only proposed
value can be chosen)
Liveness
◦ Good things eventually happen
◦ Eventually, the process will all agree on a single value.
Safety requirements for
consensus:
Only a value that has been proposed may be chosen
Only a single value is chosen
A process never learns that a value has been chosen unless it actually
has been
Liveness means eventually a single value will be chosen and we will
leave this issue to the end of this lecture.
Assumptions of Paxos
Agents can communicate with one another by sending messages
Agents operate at arbitrary speed, may fail by stopping, and may
restart. Since all agents may fail after a value is chosen and then
restart, a solution is impossible unless some information can be
remembered by an agent that has failed and restarted (by using hard
disks)
Messages can take arbitrarily long to be delivered, can be duplicated
and can be lost, but they are not corrupted (no byzantine fault, no
code penetration)
This model can fit to some practical environments such as clusters in a
data center.
The consensus is hard
Network failure
Process failure
Network delay
Membership change: A process join and leave the system
A single accepter?
How:
◦ a proposer sends a proposal to the accepter, and the accepter chose the
first proposed value that it receives
Work?
◦ No, the single accepter can fail
So, if an algorithm might work, it should use multiple accepters.
A proposer sends a proposed value to a set of acceptors. An acceptor
may (or may not) accept the proposed value.
Chosen (Decided): The value is chosen when a large enough set of
acceptors have accepted it.
How large is large enough?
Chosen (Decided): The value is chosen when a large enough set of
acceptors have accepted it.
To ensure that only a single value is chosen, we can let a large enough
set consist of any majority of the agents.
Because any two majorities have at least one acceptor in common,
this works if an acceptor can accept at most one value.
You can define this in some other way as “majority” or “large enough”.
First requirement we should
meet
We are very lucky that there is no failure, no message loss, no network
delay. Everything works very well. We want a value to be chosen even
if only one value is proposed by a single proposer. (Everything works
very well, of course we can expect this.)
This suggests the requirement: (if an algorithm really works)
P1. An acceptor must accept the first proposal that it receives.
But……
Every accepter has accepted a value, but no
single value is accepted by a “Majority” of them.
Even only two proposed values, failure of a
single accepter could make it impossible to learn
which of the values was chosen.
Proposal
P1 and the requirement that a value is chosen only when it is accepted
by a majority of acceptors imply that an acceptor must be allowed to
accept more than one proposal.
An algorithm which might work should use multiple proposals. We can
differentiate the proposals by tagging with a natural number.
A proposal: <proposal_number, proposal_value>
Different proposals have different numbers (but may have the same
values).
Different proposals from
different proposers
One method, you can define others. What is the way in the lab?
Proposal Chosen?
A value is chosen when a single proposal with that value has been
accepted by a majority of the acceptors. Notice that we say that a
proposal is chosen which means both number and value. In that case, we
say that the proposal (as well as its value) has been chosen.
We have not discussed any algorithm until now. Image that for a specific
accepter, it can accept multiple proposals. Thus, we can allow multiple
proposals to be chosen.
However:
P2. If a proposal with value v is chosen, then every higher-numbered
proposal that is chosen has value v. (If a proposal is chosen, the value
should not be destroyed by the future execution of the algorithm. In
distributed environment, the algorithm might execute for ever until
somebody tell them to stop.)
Since numbers are totally ordered, condition P2 guarantees the crucial
safety property that only a single value is chosen.
Strengthening P2
requirements:
To be chosen, a proposal must be accepted by at least one acceptor. We
can satisfy P2 by satisfying:
◦ P2a. If a proposal with value v is chosen, then every higher-numbered proposal
accepted by any acceptor has value v.
But, someone might propose another value after the step of chosen value
v with the proposal number n. Further strengthening:
◦ P2b. If a proposal with value v is chosen, then every higher-numbered proposal
issued by any proposer has value v.
So, if we meet P2b, we can meet P2a and then meet P2.
What does P2b mean?
◦ A proposer should not make its proposal arbitrary. It should do
something before making its proposal. Learning the history is easy,
predict the future is difficult.
How to get P2b?
We would assume that some proposal with number m and value v is
chosen and show that any proposal issued with number n > m also has
value v. Using induction on n, assume every proposal issued with a
number in m … (n-1) has value v.
For the proposal numbered m to be chosen, there must be some set C
consisting of a majority of acceptors such that every acceptor in C
accepted it.
Thus:
◦ Every acceptor in C has accepted a proposal with number in m … (n-1), and
every proposal with number in m…(n-1) accepted by any acceptor has value
v.
How to make the proposals?
Since any set S consisting of a majority of accepters contains at least
one member of C, we can conclude that a proposal numbered n as
value v by ensuring that the following invariant is maintained:
P2c. For any v and n, if a proposal with value v and number n is issued,
then there is a set S consisting of a majority of acceptors such that
either (a) no acceptor in S has accepted any proposal numbered less
than n, or (b) v is the value of the highest-numbered proposal among
all proposals numbered less than n accepted by the acceptors in S.
Let the proposer learn something first and then make the proposals.
Paxos Algorithm
Until now, we have not discuss any algorithm yet. Lets see Paxos then.
Proposer : Prepare proposals
Accepter : Accept or reject (not accept) proposals
Learner : Learn the current status of the system. If the value is decided,
notify the person who are supposed to know the value
Step 1: Prepare
Proposer
2
Proposer
1
(a) A proposer selects a
proposal number n and sends
a PREPARE
PREPARE k
request with number n to a
majority of acceptors.
PREPARE j
Acceptor
Acceptor
k>j
Acceptor
Step 2: Promise
Proposer
2
Proposer
1
PROMISE n – Acceptor
will accept proposals only
numbered n or higher
PROMISE j
PROMISE k
Proposer 1 is ineligible
because a quorum has
voted for a higher
number than j
(b) If an acceptor receives a prepare request
with number n greater than that of any prepare
request to which it has already responded, then
it responds to the request with a promise not to
accept any more proposals numbered less than
n and with the highest-numbered proposal (if
any) that it has accepted.
Acceptor
Acceptor
PROMISE k
Acceptor
k>j
P1a . An acceptor can accept a
proposal numbered n iff it has not
responded to a prepare request having
a number greater than n.
Step 3: Accept!
Proposer
2
Proposer
1
ACCEPT!
Acceptor
(v_k, k)
Acceptor
Acceptor
Proposer 1 is disqualified; Proposer 2 offers a value
(a) If the proposer receives a
response to its prepare requests
(numbered n) from a majority of
acceptors, then it sends an ACCEPT
request to each of those acceptors
for a proposal numbered n with a
value v, where v is the value of the
highest-numbered proposal among
the responses, or is any value if the
responses reported no proposals.
Step 4: Accepted
Proposer
2
Proposer
1
Accepted k
Acceptor
Acceptor
Acceptor
A quorum has accepted value v_k; it is now a fact
(b) If an acceptor receives an accept
request for a proposal numbered
n, it accepts the proposal unless it
has already responded to a PREPARE
request having a number greater
than n.
Learning values
Proposer
1
Proposer
2
Learner
Proposer
1
Proposer
2
Learner
v?
V_k
Acceptor
Acceptor
Acceptor
Acceptor
Acceptor
Acceptor
If a learner interrogates the system,
a quorum will respond with fact V_k
A learner will send LEARN request to all (or majority) of the accepters.
Acceptors will response with the accepted proposals. If a proposal is
accepted by the majority of accepters, this proposal is the decided one.
Proposer Code
struct proposal {number, value} //n>=1
proposer_make_proposal(n, pvalue)
send(PREPARE, n) to a majority of accepters;
wait until [received (ACK-PREPARE, proposal) from a majority of accepters]
received_proposals = [all received proposals]
old_max_proposal = a proposal in received_roposals with the maximal proposal
number
if old_max_proposal.number > n
abandon_making_proposal; return; //abandon
if old_max_proposal == null
newproposal = (n, pvalue);
else
newproposal = (n, old_max_proposal.value);
send(ACCEPT, new proposal) to a majority of accepters; //or all accepters
Accepters response to
PREPARE
old_prepare_number;
accepted_proposals;
accepter_on_receive_prepare(PREPARE,number,proposer)
if number > old_prepare_number
old_prepare_number = number;
old_max_proposal = a proposal in accepted_proposals with max
proposal number
send(ACK_PREPARE, n, old_max_proposal) to proposer
else
either also send back the old_max_proposal or just ignore the
message
Accepter response to ACCEPT
accepter_on_receive_accept(ACCEPT, proposal, proposer)
if proposal.number ≥ old_prepare_number
accepted_proposals = accepted_proposals ∪ proposal
else
either send back the old_max_proposal or just ignore the message
Learner
repeat
send (LEARN) to all accepters
accepted_proposals = all proposals replied
until there exists a proposal that it is accepted by a majority accepters
proposal is chosen
Proposer response to LEARN
accepter_on_receive_learn(learner)
send(ACK-LEARN, accepted_proposals) to learner
Why Paxos is correct?
The key is “do not break the value if it is chosen”. The proposer follows the
algorithm strictly and make the proposal based on the collected history
information.
To prove Paxos is correct, we should prove:
P2b. If a proposal with value v is chosen, then every higher-numbered
proposal issued by any proposer has value v.
Which means if decided, no further action should destroy the decision.
◦ If proposal <m,v> is decided, the prepare phase will restrict the proposer to
make any proposal with the only value of v as the proposer will only be the
returned value from accepters. The proposer can only make the proposals while
getting majority response from accepters which of course intersect with the
decided proposal set of <m,v>.
◦ This is what P2c said.
◦ So, Paxos will not destroy a value if it is decided i.e. meets the safety
requirement. What about the liveness requirement? Will Paxos truly goes to
consensus among a group of process?
Progress (liveness)
Even two proposers might bring the system to live lock without doing
anything useful.
So, there should be only one proposer which can be considered as the
leader in the group.
Leader Election(1)
leader; //leader process, initialized to p2, the process with smallest id
proposer_self; //each proposer has its own id
proposer_start_leader_election
repeat periodically forever
send(ELECTION) to all proposers
wait for a while and receiving leader election messages; //”a while”
can be 2x(largest latency)
active_proposers = all proposers that send back the ACK-ELECTION
message
leader = a proposer in active_proposers with minimal proposer_id
Leader Election(2)
proposer_on_receive_election(proposer)
send(ACK-ELECTION, proposer_self) to proposer
Leader Election(3, leader code)
current_proposal_number;
proposer_make_proposals
repeat forever
wait for a while; //3x(maximal latency)
if leader = proposer_self
stop the existing proposer_make_proposal;
current_proposal_number = current_proposal_number + np;
start a new call of proposer_make_proposal;
Discussion
1 Can these two proposals considered the same? <100, “hello”>, <200,
“hello>, consider there are only three accepters and using this group
to illustrate the principle of paxos proposals.
2 If everything is OK, what about the performance of Paxos? How can
a bunch of operations can be batched?
Thank you! Any Questions?