CS 471 - Lecture 7 Distributed Coordination George Mason University Fall 2009 Distributed Coordination      Time in Distributed Systems Logical Time/Clock Distributed Mutual Exclusion Distributed Election Distributed Agreement GMU –

Transcript CS 471 - Lecture 7 Distributed Coordination George Mason University Fall 2009 Distributed Coordination      Time in Distributed Systems Logical Time/Clock Distributed Mutual Exclusion Distributed Election Distributed Agreement GMU –

CS 471 - Lecture 7 Distributed Coordination

George Mason University Fall 2009

Distributed Coordination

    

Time in Distributed Systems Logical Time/Clock Distributed Mutual Exclusion Distributed Election Distributed Agreement

GMU – CS 571 7.2



Time in Distributed Systems

Distributed Systems



No global clock

 • •

Algorithms for clock synchronization are useful for concurrency control based on timestamp ordering distributed transactions



Physical clocks - Inherent limitations of clock synchronization algorithms

 •

Logical time is an alternative It gives the ordering of events

GMU – CS 571 7.3

Time in Distributed Systems



Updating a replicated database and leaving it in an inconsistent state. Need a way to ensure that the two updates are performed in the same order at each database.

GMU – CS 571 7.4

Clock Synchronization Algorithms

 

Even clocks on different computers that start out synchronized will typically skew over time. The relation between clock time and UTC (Universal Time, Coordinated) when clocks tick at different rates.

Is it possible to synchronize all clocks in a distributed system?

GMU – CS 571 7.5

Physical Clock Synchronization



Cristian’s Algorithm and NTP ( Network Time Protocol) – periodically get information from a time server (assumed to be accurate).



Berkeley – active time server uses polling to compute average time. Note that the goal is the have correct ‘relative’ time

GMU – CS 571 7.6

Logical Clocks

 

Observation: It may be sufficient that every node agrees on a current time – that time need not be ‘real’ time.

Taking this one step further, in some cases, it is often adequate that two systems simply agree on the order in which system events occurred.

GMU – CS 571 7.7

Ordering events

    

In a distributed environment, processes can communicate only by messages Event: the occurrence of a single action that a process carries out as it executes e.g. Send, Receive, update internal variables/state..

For e-commerce applications, the events may be ‘client dispatched order message’ or ‘merchant server recorded transaction to log’ Events at a single process p i can be placed in a total ordering by recording their occurrence time Different clock drift rates may cause problems in global event ordering

GMU – CS 571 7.8

Logical Time



The order of two events occurring at two different computers cannot be determined based on their “local” time.



Lamport proposed using logical time and logical clocks to infer the order of events (causal ordering) under certain conditions (1978).



The notion of logical time/clock is fairly general and constitutes the basis of many distributed algorithms.

GMU – CS 571 7.9

“

Happened Before” relation



Lamport defined a “happened before” relation (



) to capture the causal dependencies between events.





B, if A and B are events in the same process and A occurred before B.





B, if A is the event of sending a message m in a process and B is the event of the receipt of the same message m by another process.



If A



B, and B



C, then A



relation is transitive).

C (happened-before

GMU – CS 571 7.10

“

Happened Before” relation

p 1 a p 2 c d Phy sical time m 2 p 3  

 e

b (at p1) c



d (at p2); b



c ; also d

 f

Not all events are related by the “



” relation (partial order). Consider a and e (different processes and no chain of messages to relate them) If events are not related by “



”, they are said to be concurrent (written as a || e)

GMU – CS 571 b m 1 7.11

Logical Clocks

  

In order to implement the “happened-before” relation, introduce a system of logical clocks.

A logical clock is a monotonically increasing software counter. It need not relate to a physical clock.

Each process p

has a logical clock L

which can be used to apply logical timestamps to events

GMU – CS 571 7.12

Logical Clocks (Update Rules)



LC1: L

is incremented by 1 before each internal event at process p



LC2: (Applies to send and receive)

•

when process p

sends message m, it increments L

by 1 and the message is assigned a timestamp t = L

•

when p

receives (m,t) it first sets L j to max(L j , t) and then increments L j by 1 before timestamping the event receive(m)

GMU – CS 571 7.13

1 2

Logical Clocks (Cont.)

p 1 a b m 1 3 4 p 2 Phy sical time c d m 2 1 5 p 3 e f  

each of p1, p2, p3 has its logical clock initialized to zero.

the indicated clock values are those immediately after the events.



for

1, 2 is piggybacked and

gets

max

(0,2)+1 = 3

GMU – CS 571 7.14

Logical Clocks



(a) Three processes, each with its own clock. The clocks run at different rates. (b) As messages are exchanged, the logical clocks are corrected.

GMU – CS 571 7.15

Logical Clocks (Cont.)

 



’ implies L(e) < L(e’) The converse is not true, that is L(e) < L(e') does not imply e



’.

(e.g. L(b) > L(e) but b || e)



In other words, in Lamport’s system, we can guarantee that if L(e) < L(e’) then e’ did not happen before e. But we cannot say for sure whether e “happened-before” e’ or they are concurrent by just looking at the timestamps.

GMU – CS 571 7.16

Logical Clocks (Cont.)



Lamport’s “happened before” relation defines an irreflexive partial order among the events in the distributed system.



Some applications require that a total order be imposed on all the events.



We can obtain a total order by using clock values at different processes as the criteria for the ordering and breaking the ties by considering process indices when necessary.

GMU – CS 571 7.17

Logical Clocks



The positioning of Lamport’s logical clocks in distributed systems.

GMU – CS 571 7.18

An Application: Totally-Ordered Multicast



Consider a group of n distributed processes. At any time, m ≤ n processes may be multicasting “update” messages to each other.

•

The parameter m is not known to individual nodes



How can we devise a solution to guarantee that all the updates are performed in the same order by all the processes, despite variable network latency?

 GMU – CS 571

Assumptions

• •

No messages are lost (Reliable delivery) Messages from the same sender are received in the order they were sent (FIFO)

•

A copy of each message is also sent to the sender

7.19

An Application of Totally-Ordered Multicast



Updating a replicated database and leaving it in an inconsistent state.

GMU – CS 571 7.20

Totally-Ordered Multicast (Cont.)

 

Each multicast message is always time-stamped with the current (logical) time of its sender.

When a (kernel) process receives a multicast update request:

•

It first puts the message into a local queue instead

of directly delivering to the application (i.e. instead

of blindly updating the local database). The local queue is ordered according to the timestamps of update-request messages.

•

It also multicasts an acknowledgement to all other processes (naturally, with a timestamp of higher value).

GMU – CS 571 7.21

An Application: Totally-Ordered Multicast (Cont.)



Local Database Update Rule



A process can deliver a queued message to the application it is running (i.e. the local database can be updated) only when that message is at the head of the local queue and has been acknowledged by all other processes.

GMU – CS 571 7.22

An Application: Totally-Ordered Multicast (Cont.)

For example: process 1 and process 2 each want to perform an update.

Process 1 sends update requests with timestamp t x to itself and process 2 2.

Process 2 sends update requests with timestamp t y to itself and process 1 3.

When each process receives the requests, it puts the requests on their queues in timestamp order. If t x = t y then process 1’s request is first. NOTE: the queues will be identical. 4.

Each sends out acks to the other (with larger timestamp).

The same request will be on the front of each queue – once the ack for the process on the front of the queue is received, its update is processed and removed.

GMU – CS 571 7.23

Why can’t this scenario happen?

At DB 1: Received request 1 and request 2 with timestamps 4 and 5, as well as acks from ALL processes of request 1 . DB1 performs request 1 .

At DB 2: Received request 2 with timestamp 5, but not the request 1 with timestamp 4 and acks from ALL processes of request 2 . DB2 performs request 2 .

DB1 DB2 GMU – CS 571 4: request1 ??

5:request2 6: ack Violation of FIFO assumption (slide 19): request 1 must have already been received for DB2 to have received the ack for its request receive ack [timestamp > 6] 7.24

Distributed Mutual Exclusion (DME)



Assumptions

•

The system consists of n processes; each process P

resides at a different processor.

•

For the sake of simplicity, we assume that there is only one critical section that requires mutual exclusion.

• •

Message delivery is reliable Processes do not fail (we will later discuss the implications of relaxing this).



The application-level protocol for executing a critical section proceeds as follows:

• • •

Enter() : enter critical section (CS) – block if necessary ResourceAccess(): access shared resources in CS Exit(): Leave CS – other processes may now enter.

GMU – CS 571 7.25

DME Requirements



Safety (Mutual Exclusion): At most one process may execute in the critical section at a time.



Bounded-waiting: Requests to enter and exit the critical section eventually succeed.

GMU – CS 571 7.26

Evaluating DME Algorithms



Performance criteria



The bandwidth consumed, which is proportional to the number of messages sent in each entry and exit operation.



The synchronization delay which is necessary between one process exiting CS and the next process entering it.

•

When evaluating the synchronization delay, we should think of the scenario in which a process P a is in the CS, and P b , which is waiting, is the next process to enter.

•

The maximum delay between P a ’s exit and P b ’s entry is called the “synchronization delay”.

GMU – CS 571 7.27

DME: The Centralized Server Algorithm

    

One of the processes in the system is chosen to coordinate the entry to the critical section.

A process that wants to enter its critical section sends a request message to the coordinator.

The coordinator decides which process can enter the critical section next, and it sends that process a reply message.

When the process receives a reply message from the coordinator, it enters its critical section.

After exiting its critical section, the process sends a release message to the coordinator and proceeds with its execution.

GMU – CS 571 7.28

Mutual Exclusion: A Centralized Algorithm

a) b) c) Process 1 asks the coordinator for permission to enter a critical region. Permission is granted Process 2 then asks permission to enter the same critical region. The coordinator does not reply.

When process 1 exits the critical region, it tells the coordinator, when then replies to 2

GMU – CS 571 7.29

DME: The Central Server Algorithm (Cont.)

   

Safety?

Bounded waiting?

This scheme requires three messages per critical section entry: (

request

and

reply

)

•

Entering the critical section –even when no process currently is in CS – takes two messages.

•

Exiting the critical section takes one

release

message. The synchronization delay for this algorithm is the time taken for a round-trip message.



Problems?

GMU – CS 571 7.30

DME: Token-Passing Algorithms



A number of DME algorithms are based on

token-passing



A single token is passed among the nodes.



The node willing to enter the critical section will need to possess the token.



Algorithms in this class differ in terms of the logical topology they assume, run-time message complexity and delay.

GMU – CS 571 7.31

DME – Ring-Based Algorithm



Idea: Arrange the processes in a logical ring.

The ring topology may be unrelated to the physical interconnections between the underlying computers.



Each process

p i

has a communication channel to the next process in the ring,

p (i+1)mod N



A single token is passed from process to process over the ring in a single direction.



Only the process that has the token can enter the critical section.

GMU – CS 571 7.32

DME – Ring-Based Algorithm (Cont.)

p n p

4 Token GMU – CS 571 7.33

DME – Ring-Based Algorithm (Cont.)



If a process is not willing to enter the CS when it receives the token, then it immediately forwards it to its neighbor.



A process requesting the token waits until it receives it, but retains it and enters the CS. To exit the CS, the process sends the token to its neighbor.



Requirements:

• •

Safety?

Bounded-waiting?

GMU – CS 571 7.34

DME – Ring-Based Algorithm (Cont.)



Performance evaluation

•

To enter a critical section may require between 0 and N messages.

•

To exit a critical section requires only one message.

•

The synchronization delay between one process’ exit from CS and the next process’ entry is anywhere from 1 to N-1 message transmissions.

GMU – CS 571 7.35

p 1

Another token-passing algorithm

Token Direction 1

p 2 p 3 p 4 p 5 p 6

Token Direction 2

   

Nodes are arranged as a logical linear tree The token is passed from one end to another through multiple hops When the token reaches one end of the tree, its direction is reversed A node willing to enter the CS waits for the token, when it receives, holds it and enters the CS.

GMU – CS 571 7.36

DME - Ricart-Agrawala Algorithm[1981]



A Distributed Mutual Exclusion (DME) Algorithm based on logical clocks



Processes willing to enter a critical section multicast a request message, and can enter only when all other processes have replied to this message.



The algorithm requires that each message include its

logical timestamp

with it. To obtain a total ordering of logical timestamps, ties are broken in favor of processes with smaller indices.

GMU – CS 571 7.37

DME – Ricart-Agrawala Algorithm (Cont.)



Requesting the Critical Section

•

When a process P i wants to enter the CS, it sends a timestamped REQUEST message to all processes.

•

When a process P j receives a REQUEST message from process P i , it sends a REPLY message to process P i if

 

Process P j is neither requesting nor using the CS, or Process P j is requesting and P i ’s request’s timestamp is smaller than process P j ’s own request’s timestamp.

(If none of these two conditions holds, the request is deferred.)

GMU – CS 571 7.38

DME – Ricart-Agrawala Algorithm (Cont.)



Executing the Critical Section: Process P i enters the CS after it has received REPLY messages from all other processes.



Releasing the Critical Section: When process P i exits the CS, it sends REPLY messages to all deferred requests.



Observe: A process’ REPLY message is blocked only by processes that are requesting the CS with higher priority (smaller timestamp). When a process sends out REPLY messages to all the deferred requests, the process with the next highest priority request receives the last needed REPLY message and enters the CS.

GMU – CS 571 7.39

Distributed Mutual Exclusion: Ricart/Agrawala a) b) c)

GMU – CS 571

Processes 0 and 2 want to enter the same critical region at the same moment.

Process 0 has the lowest timestamp, so it wins.

When process 0 is done, it sends an OK also, so 2 can now enter the critical region.

7.40

DME – Ricart-Agrawala Algorithm (Cont.)



The algorithm satisfies the Safety requirement.

•

Suppose two processes

P i

and

P j

enter the CS at the same time, then both of these processes must have replied to each other. But since all the timestamps are totally ordered, this is impossible.



Bounded-waiting?

GMU – CS 571 7.41

DME – Ricart-Agrawala Algorithm (Cont.)



Gaining the CS entry takes

2 (N – 1)

messages in this algorithm.



The synchronization delay is only one message transmission time.



The performance of the algorithm can be further improved.

GMU – CS 571 7.42

Comparison

Algorithm

Centralized Distributed (Ricart/Agrawala) Token ring

Messages per entry/exit

3 2 ( n – 1 ) 1 to

 Delay before entry (in message times)

2 2 ( n – 1 ) 0 to n – 1

Problems

Coordinator crash Crash of any process Lost token, process crash



A comparison of three mutual exclusion algorithms.

GMU – CS 571 7.43

Election Algorithms



Many distributed algorithms employ a

coordinator

process that performs functions needed by the other processes in the system

• •

enforcing mutual exclusion maintaining a global wait-for graph for deadlock detection

• •

replacing a lost token controlling an input/output device in the system



If the coordinator process fails due to the failure of the site at which it resides, a new

coordinator

must be selected through an

election algorithm.

GMU – CS 571 7.44

Election Algorithms (Cont.)



We say that a process

calls the election

if it takes an action that initiates a particular run of the election algorithm.

•

An individual process does not call more than one election at a time.

•

processes could call

N concurrent

elections.



At any point in time, a process

p i participant

non-participant

is either a in some run of the election algorithm.



The identity of the newly elected coordinator must be unique, even if multiple processes call the election concurrently.

GMU – CS 571 7.45

Election Algorithms (Cont.)



Election algorithms assume that a unique priority number Pri(i) is associated with each active process

p i

in the system



numbers indicate higher priorities.

Larger



Without loss of generality, we require that the elected process be chosen as the one with the largest identifier.



How to determine identifiers?

GMU – CS 571 7.46

Ring-Based Election Algorithm



Chang and Roberts (1979)



During the execution of the algorithm, the processes exchange messages on a

unidirectional logical ring

(assume clock-wise communication)



Assumptions:

• •

no failures occur during the execution of the algorithm reliable message delivery

GMU – CS 571 7.47

Ring-Based Election Algorithm (Cont.)



Initially, every process is marked as a

non participant

in an election.



Any process can begin an election (if, for example, it discovers that the current coordinator has failed).

It proceeds by marking itself as a

participant

, placing its identifier in an

election message

sending it to its clockwise neighbor. and

GMU – CS 571 7.48

Ring-Based Election Algorithm (Cont.)



When a process receives an

election

message, it compares the identifier in the message with its own.

•

If the arrived identifier is greater , then it forwards the message to its neighbor. It also marks itself as a

participant.

•

If the arrived identifier is smaller and the receiver is not a

participant

then it substitutes its own identifier in the election message and forwards it. It also marks itself as a

participant.

•

If the arrived identifier is smaller and the receiver is a

participant ,

it does not forward the message.

•

If the arrived identifier is that of the receiver itself , then this process’ identifier must be the largest, and it becomes the coordinator.

GMU – CS 571 7.49

Ring-Based Election Algorithm (Cont.)



The coordinator marks itself as a

non participant

once more and sends an

elected

message to its neighbor, announcing its election and enclosing its identity.



When a process

p i

receives an

elected

message, it marks itself as a

non-participant,

sets its variable

coordinator-id

to the identifier in the message, and unless it is the new coordinator, forwards the message to its neighbor.

GMU – CS 571 7.50

Ring-Based Election Algorithm (Example)

•3 •17 •4 •24 •9 •1 •15 •28 •24

Note: The election was started by process 17.

The highest process identifier encountered so far is 24. Participant processes are shown darkened

GMU – CS 571 7.51

Ring-Based Election Algorithm (Cont.)



Observe:

•

Only one coordinator is selected and all the processes agree on its identity.

•

The

non-participant

and

participant

states are used so that messages arising when another process starting another election at the same time are extinguished as soon as possible.



If only a single process starts an election, the worst case performing case is when its anti-clockwise neighbor has the highest identifier (requires 3N – 1 messages).

GMU – CS 571 7.52

The Bully Algorithm



Garcia-Molina (1982).



Unlike the ring-based algorithm:

• • •

It allows crashes during the algorithm execution It assumes that each process knows which processes have higher identifiers, and that it can communicate with all such processes directly.

It assumes that the system is synchronous (uses timeouts to detect process failures)



Reliable message delivery assumption.

GMU – CS 571 7.53

The Bully Algorithm (Cont.)



A process begins an election when it notices, through timeouts, that the coordinator has failed.



Three types of messages

•

election

message is sent to announce an election.

•

answer

message is sent in response to an election message.

•

coordinator

coordinator”).

message is sent to announce the identity of the elected process (the “new

GMU – CS 571 7.54

The Bully Algorithm (Cont.)



How can we construct a reliable failure detector?



There is a maximum message transmission delay

T trans T process

and a maximum message processing delay



If a process does not receive a reply within

T = 2 T trans + T process

then it can infer that the intended recipient has failed



Election will be needed.

GMU – CS 571 7.55

The Bully Algorithm (Cont.)



The process that knows it has the highest identifier can elect itself as the coordinator simply by sending a

coordinator

message to all processes with lower identifiers.



A process with lower identifier begins an election by sending an

election

message to those processes that have a higher identifier and awaits an

answer

message in response.

•

If none arrives within time T, the process considers itself the coordinator and sends a

coordinator

message to all processes with lower identifiers.

•

If a reply arrives, the process waits a further period

T’

for a

coordinator

message to arrive from the new coordinator. If none arrives, it begins another election.

GMU – CS 571 7.56

The Bully Algorithm (Cont.)



If a process receives an

election

message, it sends back an

answer

message and begins another election (unless it has begun one already).



If a process receives a

coordinator

message, it sets its variable

coordinator-id

to the identifier of the coordinator contained within it.

GMU – CS 571 7.57

The Bully Algorithm: Example

Stage 1 election p 1 election answer p 2 answer p 3 C p 4

•P1 starts the election once it notices that P4 has failed •Both P2 and P3 answer GMU – CS 571 7.58

The election of coordinator p 2 , after the failure of p 4 and then p 3

The Bully Algorithm: Example

Stage 2 p 1 election election election C p 2 answer p 3 p 4

GMU – CS 571 •P1 waits since it knows it can’t be the coordinator •Both P2 and P3 start elections •P3 immediately answers P2 but needs to wait on P4 •Once P4 times out, P3 knows it can be coordinator but…

The election of coordinator p 2 , after the failure of p 4 and then p 3

7.59

GMU – CS 571

The Bully Algorithm (Example)

Stage 3 timeout p 1 p 2 p 3 p 4

•If P3 fails before sending coordinator message, P1 will eventually start a new election since it hasn’t heard about a new coordinator 7.60

GMU – CS 571

The Bully Algorithm (Example)

Eventually.....

coordinator C p 1 p 2 p 3 p 4

7.61

The Bully Algorithm (Cont.)



What happens if a crashed process recovers and immediately initiates an election?



If it has the highest process identifier (for example P4 in previous slide), then it will decide that it is the coordinator and may choose to announce this to other processes.

•

It will become the coordinator, even though the current coordinator is functioning (hence the name “bully”)

•

This may take place concurrently with the sending of coordinator message by another process which has previously detected the crash.

•

Since there are no guarantees on message delivery order, the recipients of these messages may reach different conclusions regarding the id of the coordinator process.

GMU – CS 571 7.62

The Bully Algorithm (Cont.)



Similarly, if the timeout values are inaccurate (that is, if the failure detector is unreliable), then a process with large identifier but slow response may cause problems.



Algorithm’s performance:

•

Best case: The process with second largest identifier notices the coordinator’s failure



N – 2

messages.

•

Worst-case: The process with the smallest identifier notices the failure



O(N 2 )

messages

GMU – CS 571 7.63

Election in Wireless environments (1)

  

Traditional election algorithms assume that communication is reliable and that topology does not change.

This is typically not the case in many wireless environments Vasudevan [2004] – elect the ‘best’ node in ad hoc networks

GMU – CS 571 7.64

Election in Wireless environments (1)

To elect a leader, any node in the network can start an election by sending an ELECTION message to all nodes in its range.

When a node receives an ELECTION for the first time, it chooses the sender as its parent and sends the message to all nodes in its range except the parent.

When a node later receives additional ELECTION messages from a non-parent, it merely acknowledges the message Once a node has received acknowledgements from all neighbors except parent, it sends an acknowledgement to the parent. This acknowledgement will contain information about the best leader candidate based on resource information of neighbors.

Eventually the node that started the election will get all this information and use it to decide on the best leader – this information can be passed back to all nodes.

GMU – CS 571 7.65

Elections in Wireless Environments (2)



Election algorithm in a wireless network, with node a as the source. (a) Initial network. (b) The build-tree phase

GMU – CS 571 7.66

Elections in Wireless Environments (3)



The build-tree phase

GMU – CS 571 7.67

Elections in Wireless Environments (4)



The build-tree phase and reporting of best node to source.

GMU – CS 571 7.68

Reaching Agreement



There are applications where a set of processes wish to agree on a common “value”.



Such agreement may not take place due to:

• •

Unreliable communication medium Faulty processes



Processes may send garbled or incorrect messages to other processes.



A subset of the processes may collaborate with each other in an attempt to defeat the scheme.

GMU – CS 571 7.69

Agreement with Unreliable Communications



Process

P i

process

P j

at site

, sends a message to



To proceed,

P i

the message.

needs to know if

P j

has received



P i

can detect transmission failures using a time out scheme.

GMU – CS 571 7.70

Agreement with Unreliable Communications



Suppose that

P j

also needs to know that

P i

has received its acknowledgment message, in order to decide on how to proceed.



Example: Two-army problem Two armies need to coordinate their action (“attack” or “do not attack”) against the enemy.

•

If only one of them attacks, the defeat is certain.

• •

If both of them attack simultaneously, they will win. The communication medium is unreliable, they use acknowledgements.

•

One army sends “attack” message to the other. Can it initiate the attack?

GMU – CS 571 7.71

Agreement with Unreliable Communication

Messenger Blue Army #1 Blue Army #2 Red Army



The two-army problem

GMU – CS 571 7.72

Agreement with Unreliable Communications



In the presence of unreliable communication medium, two parties can never reach an agreement, no matter how many acknowledgements they send.

•

Assume that there is some agreement protocol that terminates in a finite number of steps. Remove any extra steps at the end to obtain the minimum-length protocol that works.

•

Some message is now the last one and it is essential to the agreement. However, the sender of this last message does not know if the other party received it or not.



It is not possible in a distributed environment for processes

P i

and

P j

to agree completely on their respective states with 100% certainty in the face of unreliable communication, even with non-faulty processes.

GMU – CS 571 7.73

Reaching Agreement with Faulty Processes

  

Many things can go wrong… Communication

•

Message transmission can be unreliable

•

Time taken to deliver a message is unbounded

•

Adversary can intercept messages



Processes

•

Can fail or team up to produce wrong results Agreement very hard, sometime impossible, to achieve!

GMU – CS 571 7.74

Agreement in Faulty Systems - 5

System of N processes, where each process

will provide a value

v i

to each other. Some number of these processes may be incorrect (or malicious) Goal: Each process learn the true values sent by each of the correct processes 

The Byzantine agreement problem for three nonfaulty and one faulty process.

GMU – CS 571 7.75

Byzantine Agreement Problem



Three or more generals are to agree to attack or retreat.



Each of them issues a vote. Every general decides to attack or retreat based on the votes.



But one or more generals may be “faulty”: may supply wrong information to different peers at different times



Devise an algorithm to make sure that:

•

The “correct” generals should agree on the same decision at the end (attack or retreat)

Lamport, Shostak, Pease. The Byzantine General’s Problem. ACM TOPLAS, 4,3, July 1982, 382-401.

GMU – CS 571 7.76

Impossibility Results

attack

General 1

attack attack

General 1

retreat

General 2 General 3 General 2 General 3

retreat retreat No solution for three processes can handle a single traitor.

In a system with m faulty processes agreement can be achieved only if there are 2m+1 (more than 2/3) functioning correctly.

GMU – CS 571 7.77

Byzantine General Problem: Oral Messages Algorithm

P1 1 1 1 P2 

Phase 1: Each process sends its value (troop strength) to the other processes. Correct processes send the same (correct) value to all. Faulty processes may send different values to each if desired (or no message).

P3 P4 

Assumptions: 1) Every message that is sent is delivered correctly; 2) The receiver of a message knows who sent it; 3) The absence of a message can be detected.

GMU – CS 571 7.78

Byzantine General Problem



Phase 1: Generals announce their troop strengths to each other

2 P1 P2 2 2 P3 P4 GMU – CS 571 7.79

Byzantine General Problem



Phase 1: Generals announce their troop strengths to each other

P1 P2 4 4 GMU – CS 571 P3 4 P4 7.80

Byzantine General Problem

 

Phase 2: Each process uses the messages to create a vector of responses – must be a default value for missing messages.

Each general construct a vector with all troops

P1 1 P2 2 P3 x P4 4 P1 P2 P1 1 P2 2 P3 y P4 4 x y P3 z P4 P1 1 P2 2 P3 z P4 4 GMU – CS 571 7.81

P2 P3

Byzantine General Problem

 

Phase 3: Each process sends its vector to all other processes.

Phase 4: Each process the information received from every other process to do its computation.

P1 1 a 1 P2 2 b 2 P3 y c z P4 4 d 4 P1 P2

P1 P3

P1 1 e 1 P2 2 f 2 P3 x g z P4 4 h 4

P4 P4

(a, b, c, d)

(1, 2, ?

, 4)

(e, f, g, h)

(1, 2, ?

, 4)

P1 P2 P3

P1 1 1 h P2 2 2 i P3 x y j

(1, 2, ?

, 4)

P4 4 4 k GMU – CS 571 P3 (h, i, j, k) 7.82

Byzantine General Problem

  

A correct algorithm can be devised only if



+ 1 At least

m+1

rounds of message exchanges are needed (Fischer, 1982).

Note: This result only guarantees that each process receives the true values sent by correct processors, but it does not identify the correct processes!

GMU – CS 571 7.83

Byzantine Agreement Algorithm (signed messages)

–

Adds the additional assumptions: (1) A loyal general’s signature cannot be forged and any alteration of the contents of the signed message can be detected.

(2) Anyone can verify the authenticity of a general’s signature.

–

Algorithm SM(m): The general signs and sends his value to every lieutenant.

For each i: 1.

If lieutenant i receives a message of the form v:0 from the commander and he has not received any order, then he lets V i equal {v} and he sends v:0:i to every other lieutenant.

If lieutenant i receives a message of the form v:0:j 1 :…:j k not in the set V i then he adds v to V i and v is and if k < m, he sends the message v:0:j 1 :…:j k :i to every other lieutenant other than j 1 ,…,j k For each i: When lieutenant i will receive no more messages, he obeys the order in choice(V i ).

–

Algorithm SM(m) solves the Byzantine General’s problem if there are at most m traitors.

GMU – CS 571 7.84

Signed messages

attack:0 ???

Lieutenant 1

attack:0:1

Lieutenant 2 GMU – CS 571 General

attack:0

General

attack:0 retreat:0 retreat:0:2

Lieutenant 1

attack:0:1

Lieutenant 2

SM(1) with one traitor

7.85

Global State (1)

The ability to extract and reason about the global state of a distributed application has several important applications:

• • • •

distributed garbage collection distributed deadlock detection distributed termination detection distributed debugging



While it is possible to examine the state of an individual process, getting a global state is problematic.



Q: Is it possible to assemble a global state from local states in the absence of a global clock?

GMU – CS 571 7.86

Global State (2)

 

Consider a system S of N processes p i local state of a process p i

•

(i = 1, 2, …, N). The is a sequence of events: history(p i ) = h i = We can also talk about any finite prefix of the history:

•

h i k = An event is either an internal action of the process or the sending or receiving of a message over a communication channel (which can be recorded as part of state). Each event influences the process’s state; starting from the initial state s i 0 , we can denote the state after the k th action as s i k .

 

The idea is to see if there is some way to form a global history

•

H = h 1



h 2



…



h N This is difficult because we can’t choose just any prefixes to use to form this history.

GMU – CS 571 7.87

Global State (3)

  

cut

of the system’s execution is a subset of its global history that is a union of prefixes of process histories: C = h 1 c1



h 2 c2



…



h N cN The state s i p i in the global state S corresponds to the cut C that is of immediately after the last event processed by p i set of events {e i c i :i = 1,2,…,N} is the

frontier

in the cut. The of the cut.

GMU – CS 571 p1 p2 e 1 0 p3 e 1 1 e 2 0 e 2 1 e 3 0 Frontier 7.88

e 3 1 e 3 2 e 1 2

Global State (4)

   

A cut of a system can be

inconsistent

that hasn’t been sent (in the cut).

if it contains receipt of a message A

cut

consistent

if, for each event it contains, it also contains all the events that happened-before (

→

) the event: events e



C, f

→





C A consistent global state is one that corresponds to a consistent cut.

GMU – CS 571 7.89

Global State (5)

      

The goal of

Chandy & Lamport’s ‘snapshot’ algorithm

is to record the state of each of a set of processes such a way that, even though the combination of recorded states may never have actually occurred, the recorded global state is consisent.

Algorithm assumes that: Neither channels or process fail and all messages are delivered exactly once Channels are unidirectional and provide FIFO-ordered message delivery There is a distinct channel between any two processes that communicate Any process may initiate a global snapshot at any time Processes may continue to execute and send and receive normal messages while the snapshot is taking place.

GMU – CS 571 7.90

Global State (6) a)

Organization of a process and channels for a distributed snapshot. A special marker message is used to signal the need for a snapshot.

GMU – CS 571 7.91

Global State (7)

 

Marker receiving rule for process p i

On p i ’s receipt of a marker message over channel c if (p i has not yet recorded its state) it records its process state records the state of channel c as the empty set turns on recording of messages arriving over all incoming channels else end if p i records the state of c as the set of messages it has received over c since it saved its state

Marker sending rule for process p i

After p i has recorded its state, for each outgoing channel c: p i sends one marker message over c (before it sends any other message over c)

GMU – CS 571 7.92

Global State (8)

b) c) d) Process Q receives a marker for the first time and records its local state. It then sends a marker on all of its outgoing channels.

Q records all incoming messages Once

receives another marker for all its incoming channels and finishes recording the state of the incoming channel

GMU – CS 571 7.93