No Slide Title

Download Report

Transcript No Slide Title

Distributed Systems:
Shared Data
November 2005
Distributed systems: shared data
1
Overview of chapters
•
•
•
•
•
Introduction
Co-ordination models and languages
General services
Distributed algorithms
Shared data
– Ch 13 Transactions and concurrency control, 13.1-13.4
– Ch 14 Distributed transactions
– Ch 15 Replication
November 2005
Distributed systems: shared data
2
Overview
• Transactions and locks
• Distributed transactions
• Replication
November 2005
Distributed systems: shared data
3
Overview
• Transactions
• Nested transactions
• Locks
Known material!
• Distributed transactions
• Replication
November 2005
Distributed systems: shared data
4
Transactions: Introduction
• Environment
– data partitioned over different servers on
different systems
– sequence of operations as individual unit
– long-lived data at servers (cfr. Databases)
• transactions = approach to achieve consistency
of data in a distributed environment
November 2005
Distributed systems: shared data
5
Transactions: Introduction
• Example
Person 1:
A
Withdraw ( A, 100);
Person 2:
C
Withdraw ( C, 200);
Deposit (B, 100);
B
November 2005
Distributed systems: shared data
Deposit (B, 200);
6
Transactions: Introduction
• Critical section
– group of instructions  indivisible block wrt other cs
– short duration
• atomic operation (within a server)
– operation is free of interference from operations being
performed on behalf of other (concurrent) clients
– concurrency in server  multiple threads
– atomic operation <> critical section
• transaction
November 2005
Distributed systems: shared data
7
Transactions: Introduction
• Critical section
• atomic operation
• transaction
– group of different operations + properties
– single transaction may contain operations on different
servers
– possibly long duration
ACID properties
November 2005
Distributed systems: shared data
8
Transactions: ACID
• Properties concerning the sequence of operations
that read or modify shared data:
A tomicity
C onsistency
I solation
D urability
November 2005
Distributed systems: shared data
9
Transactions: ACID
• Atomicity or the “all-or-nothing” property
– a transaction
• commits = completes successfully or
• aborts = has no effect at all
– the effect of a committed transaction
• is guaranteed to persist
• can be made visible to other transactions
– transaction aborts can be initiated by
• the system (e.g. when a node fails) or
• a user issuing an abort command
November 2005
Distributed systems: shared data
10
Transactions: ACID
• Consistency
– a transaction moves data from one consistent state to
another
• Isolation
– no interference from other transactions
– intermediate effects invisible to other transactions
The isolation property has 2 parts:
– serializability: running concurrent transactions has the
same effect as some serial ordering of the transactions
– Failure isolation: a transaction cannot see the
uncommitted effects of another transaction
November 2005
Distributed systems: shared data
11
Transactions: ACID
• Durability
– once a transaction commits, the effects of the
transaction are preserved despite subsequent failures
November 2005
Distributed systems: shared data
12
Transactions: Life histories
• Transactional service operations
– OpenTransaction()
 Trans
• starts new transaction
• returns unique identifier for transaction
– CloseTransaction(Trans)
 (Commit, Abort)
• ends transaction
• returns commit if transaction committed else abort
– AbortTransaction(Trans)
• aborts transaction
November 2005
Distributed systems: shared data
13
Transactions: Life histories
• History 1: success
T := OpenTransaction();
operation;
operation;
….
operation;
CloseTransaction(T);
November 2005
Distributed systems: shared data
Operations have
read
or
write
semantics
14
Transactions: Life histories
• History 2: abort by client
T := OpenTransaction();
operation;
operation;
….
operation;
AbortTransaction(T);
November 2005
Distributed systems: shared data
15
Transactions: Life histories
• History 3: abort by server
T := OpenTransaction();
operation;
operation;
Server aborts!
….
operation;
Error reported
November 2005
Distributed systems: shared data
16
Transactions: Concurrency
• Illustration of well known problems:
– the lost update problem
– inconsistent retrievals
• operations used + implementations
– Withdraw(A, n)
b := A.read();
A.write( b - n);
– Deposit(A, n)
b := A.read();
A.write( b + n);
November 2005
Distributed systems: shared data
17
Transactions: Concurrency
• The lost update problem:
Transaction T
Transaction U
Withdraw(A,4);
Withdraw(C,3);
Deposit(B,4);
Deposit(B,3);
Interleaved execution of operations on B  ?
November 2005
Distributed systems: shared data
18
Transactions: Concurrency
• The lost update problem:
Transaction T A  B: 4
Transaction U C  B: 3
bt := A.read();
A.write(bt-4);
A: 100
B: 200
C: 300
November 2005
Distributed systems: shared data
19
Transactions: Concurrency
• The lost update problem:
Transaction T A  B: 4
Transaction U C  B: 3
bt := A.read();
A.write(bt-4);
A: 96
bu := C.read();
B: 200
C.write(bu-3);
C: 300
November 2005
Distributed systems: shared data
20
Transactions: Concurrency
• The lost update problem:
Transaction T A  B: 4
Transaction U C  B: 3
bt := A.read();
A.write(bt-4);
A: 96
bu := C.read();
B: 200
C.write(bu-3);
C: 297
bu := B.read();
bt := B.read();
bt=200
B.write(bu+3);
November 2005
Distributed systems: shared data
21
Transactions: Concurrency
• The lost update problem:
Transaction T A  B: 4
Transaction U C  B: 3
bt := A.read();
A.write(bt-4);
A: 96
bu := C.read();
B: 203
C.write(bu-3);
C: 297
bu := B.read();
bt := B.read();
bt=200
B.write(bu+3);
B.write(bt+4);
November 2005
Distributed systems: shared data
22
Transactions: Concurrency
• The lost update problem:
Transaction T A  B: 4
Transaction U C  B: 3
bt := A.read();
A.write(bt-4);
A: 96
bu := C.read();
B: 204
C.write(bu-3);
C: 297
bu := B.read();
bt := B.read();
bt=200
B.write(bu+3);
B.write(bt+4);
Correct B = 207!!
November 2005
Distributed systems: shared data
23
Transactions: Concurrency
• The inconsistent retrieval problem:
Transaction T
Transaction U
Withdraw(A,50);
BranchTotal();
Deposit(B,50);
November 2005
Distributed systems: shared data
24
Transactions: Concurrency
• The inconsistent retrieval problem :
Transaction T A  B: 50
Transaction U BranchTotal
bt := A.read();
A.write(bt-50);
A: 100
B: 200
C: 300
November 2005
Distributed systems: shared data
25
Transactions: Concurrency
• The inconsistent retrieval problem :
Transaction T A  B: 50
Transaction U BranchTotal
bt := A.read();
A.write(bt-50);
A: 50
bu := A.read();
B: 200
bu := bu + B. read();
bu := bu + C.read();
C: 300
bt := B.read();
550
B.write(bt+50);
November 2005
Distributed systems: shared data
26
Transactions: Concurrency
• The inconsistent retrieval problem:
Transaction T A  B: 50
Transaction U BranchTotal
bt := A.read();
A.write(bt-50);
A: 50
bu := A.read();
B: 250
bu := bu + B. read();
bu := bu + C.read();
C: 300
bt := B.read();
550
B.write(bt+50);
Correct total: 600
November 2005
Distributed systems: shared data
27
Transactions: Concurrency
• Illustration of well known problems:
– the lost update problem
– inconsistent retrievals
• elements of solution
– execute all transactions serially?
• No concurrency  unacceptable
– execute transactions in such a way that overall
execution is equivalent with some serial execution
• sufficient?
• how?
November 2005
Yes
Concurrency control
Distributed systems: shared data
28
Transactions: Concurrency
• The lost update problem: serially equivalent interleaving
Transaction T A  B: 4
Transaction U C  B: 3
bt := A.read();
A.write(bt-4);
A: 100
B: 200
C: 300
November 2005
Distributed systems: shared data
29
Transactions: Concurrency
• The lost update problem: serially equivalent interleaving
Transaction T A  B: 4
Transaction U C  B: 3
bt := A.read();
A.write(bt-4);
A: 96
bu := C.read();
B: 200
C.write(bu-3);
C: 300
November 2005
Distributed systems: shared data
30
Transactions: Concurrency
• The lost update problem: serially equivalent interleaving
Transaction T A  B: 4
Transaction U C  B: 3
bt := A.read();
A.write(bt-4);
A: 96
bu := C.read();
B: 200
C.write(bu-3);
bt := B.read();
B.write(bt+4);
November 2005
C: 297
Distributed systems: shared data
31
Transactions: Concurrency
• The lost update problem: serially equivalent interleaving
Transaction T A  B: 4
Transaction U C  B: 3
bt := A.read();
A.write(bt-4);
A: 96
bu := C.read();
B: 204
C.write(bu-3);
C: 297
bu := B.read();
bt := B.read();
B.write(bt+4);
B.write(bu+3);
November 2005
Distributed systems: shared data
32
Transactions: Concurrency
• The lost update problem: serially equivalent interleaving
Transaction T A  B: 4
Transaction U C  B: 3
bt := A.read();
A.write(bt-4);
A: 96
bu := C.read();
B: 207
C.write(bu-3);
C: 297
bu := B.read();
bt := B.read();
B.write(bt+4);
B.write(bu+3);
November 2005
Distributed systems: shared data
33
Transactions: Recovery
• Illustration of well known problems:
– a dirty read
– premature write
• operations used + implementations
– Withdraw(A, n)
b := A.read();
A.write( b - n);
– Deposit(A, n)
b := A.read();
A.write( b + n);
November 2005
Distributed systems: shared data
34
Transactions: Recovery
• A dirty read problem:
Transaction T
Transaction U
Deposit(A,4);
Deposit(A,3);
Interleaved execution and abort  ?
November 2005
Distributed systems: shared data
35
Transactions: Recovery
• A dirty read problem:
Transaction T 4  A
Transaction U 3 A
bt := A.read();
A.write(bt+4);
November 2005
A: 100
Distributed systems: shared data
36
Transactions: Recovery
• A dirty read problem:
Transaction T 4  A
Transaction U 3 A
bt := A.read();
A.write(bt+4);
A: 104
bu := A.read();
A.write(bu+3);
November 2005
Distributed systems: shared data
37
Transactions: Recovery
• A dirty read problem:
Transaction T 4  A
Transaction U 3 A
bt := A.read();
A.write(bt+4);
A: 107
bu := A.read();
A.write(bu+3);
Commit
Abort
Correct result: A = 103
November 2005
Distributed systems: shared data
38
Transactions: Recovery
• Premature write or
Over-writing uncommitted values :
Transaction T
Transaction U
Deposit(A,4);
Deposit(A,3);
Interleaved execution and Abort  ?
November 2005
Distributed systems: shared data
39
Transactions: Recovery
• Over-writing uncommitted values :
Transaction T 4  A
Transaction U 3 A
bt := A.read();
A.write(bt+4);
November 2005
A: 100
Distributed systems: shared data
40
Transactions: Recovery
• Over-writing uncommitted values :
Transaction T 4  A
Transaction U 3 A
bt := A.read();
A.write(bt+4);
A: 104
bu := A.read();
A.write(bu+3);
November 2005
Distributed systems: shared data
41
Transactions: Recovery
• Over-writing uncommitted values :
Transaction T 4  A
Transaction U 3 A
bt := A.read();
A.write(bt+4);
A: 107
bu := A.read();
A.write(bu+3);
Abort
Correct result: A = 104
November 2005
Distributed systems: shared data
42
Transactions: Recovery
• Illustration of well known problems:
– a dirty read
– premature write
• elements of solution:
– Cascading Aborts: a transaction reading uncommitted
data must be aborted if the transaction that modified
the data aborts
– to avoid cascading aborts, transactions can only read
data written by committed transactions
– undo of write operations must be possible
November 2005
Distributed systems: shared data
43
Transactions: Recovery
• how to preserve data despite subsequent failures?
– usually by using stable storage
• two copies of data stored
– in separate parts of disks
– not decay related (probability of both parts corrupted is small)
November 2005
Distributed systems: shared data
44
Nested Transactions
• Transactions composed of several
sub-transactions
• Why nesting?
– Modular approach to structuring transactions in
applications
– means of controlling concurrency within a transaction
• concurrent sub-transactions accessing shared data are
serialized
– a finer grained recovery from failures
• sub-transactions fail independent
November 2005
Distributed systems: shared data
45
Nested Transactions
T = Transfer
T1 = Deposit
T2 = Withdraw
• Sub-transactions commit or abort independently
– without effect on outcome of other sub-transactions or
enclosing transactions
• effect of sub-transaction becomes durable only
when top-level transaction commits
November 2005
Distributed systems: shared data
46
Concurrency control: locking
• Environment
– shared data in a single server (this section)
– many competing clients
• problem:
– realize transactions
– maximize concurrency
• solution: serial equivalence
• difference with mutual exclusion?
November 2005
Distributed systems: shared data
47
Concurrency control: locking
• Protocols:
– Locks
– Optimistic Concurrency Control
– Timestamp Ordering
November 2005
Distributed systems: shared data
48
Concurrency control: locking
• Example:
– access to shared data within a transaction
 lock (= data reserved for …)
– exclusive locks
• exclude access by other transactions
November 2005
Distributed systems: shared data
49
Concurrency control: locking
• Same example (lost update) with locking
Transaction T
Transaction U
Withdraw(A,4);
Withdraw(C,3);
Deposit(B,4);
Deposit(B,3);
Colour of data show owner of lock
November 2005
Distributed systems: shared data
50
Concurrency control: locking
• Exclusive locks
Transaction T A  B: 4
bt := A.read();
Transaction U C  B: 3
A: 100
B: 200
C: 300
November 2005
Distributed systems: shared data
51
Concurrency control: locking
• Exclusive locks
Transaction T A  B: 4
bt := A.read();
A.write(bt-4);
Transaction U C  B: 3
A: 100
B: 200
C: 300
November 2005
Distributed systems: shared data
52
Concurrency control: locking
• Exclusive locks
Transaction T A  B: 4
bt := A.read();
A.write(bt-4);
Transaction U C  B: 3
A: 96
bu := C.read();
B: 200
C: 300
November 2005
Distributed systems: shared data
53
Concurrency control: locking
• Exclusive locks
Transaction T A  B: 4
bt := A.read();
A.write(bt-4);
Transaction U C  B: 3
A: 96
bu := C.read();
B: 200
C.write(bu-3);
C: 300
November 2005
Distributed systems: shared data
54
Concurrency control: locking
• Exclusive locks
Transaction T A  B: 4
bt := A.read();
A.write(bt-4);
Transaction U C  B: 3
A: 96
bu := C.read();
B: 200
C.write(bu-3);
bu := B.read();
C: 297
November 2005
Distributed systems: shared data
55
Concurrency control: locking
• Exclusive locks
Transaction T A  B: 4
bt := A.read();
A.write(bt-4);
Transaction U C  B: 3
A: 96
bu := C.read();
B: 200
bt := B.read();
C: 297
C.write(bu-3);
Wait for T
bu := B.read();
B.write(bt+4);
November 2005
Distributed systems: shared data
56
Concurrency control: locking
• Exclusive locks
Transaction T A  B: 4
bt := A.read();
A.write(bt-4);
Transaction U C  B: 3
A: 96
bu := C.read();
B: 204
bt := B.read();
C: 297
C.write(bu-3);
Wait for T
bu := B.read();
B.write(bt+4);
CloseTransaction(T);
November 2005
Distributed systems: shared data
57
Concurrency control: locking
• Exclusive locks
Transaction T A  B: 4
bt := A.read();
A.write(bt-4);
Transaction U C  B: 3
A: 96
bu := C.read();
B: 204
bt := B.read();
C: 297
C.write(bu-3);
Wait for T
bu := B.read();
B.write(bt+4);
CloseTransaction(T);
November 2005
Distributed systems: shared data
58
Concurrency control: locking
• Exclusive locks
Transaction T A  B: 4
bt := A.read();
A.write(bt-4);
Transaction U C  B: 3
A: 96
bu := C.read();
B: 204
C.write(bu-3);
bt := B.read();
C: 297
bu := B.read();
B.write(bt+4);
CloseTransaction(T);
November 2005
B.write(bu+3);
Distributed systems: shared data
59
Concurrency control: locking
• Exclusive locks
Transaction T A  B: 4
bt := A.read();
A.write(bt-4);
Transaction U C  B: 3
A: 96
bu := C.read();
B: 207
C.write(bu-3);
bt := B.read();
C: 297
bu := B.read();
B.write(bt+4);
CloseTransaction(T);
November 2005
B.write(bu+3);
CloseTransaction(U);
Distributed systems: shared data
60
Concurrency control: locking
• Exclusive locks
Transaction T A  B: 4
bt := A.read();
A.write(bt-4);
Transaction U C  B: 3
A: 96
bu := C.read();
B: 207
C.write(bu-3);
bt := B.read();
C: 297
bu := B.read();
B.write(bt+4);
CloseTransaction(T);
November 2005
B.write(bu+3);
CloseTransaction(U);
Distributed systems: shared data
61
Concurrency control: locking
• Basic elements of protocol
1 serial equivalence
• requirements
– all of a transaction’s accesses to a particular data item
should be serialized with respect to accesses by other
transactions
– all pairs of conflicting operations of 2 transactions should
be executed in the same order
• how?
– A transaction is not allowed any new locks after it has
released a lock
 Two-phase locking
November 2005
Distributed systems: shared data
62
Concurrency control: locking
• Two-phase locking
– Growing Phase
• new locks can be acquired
– Shrinking Phase
• no new locks
• locks are released
November 2005
Distributed systems: shared data
63
Concurrency control: locking
• Basic elements of protocol
1 serial equivalence  two-phase locking
2 hide intermediate results
• conflict between
– release of lock access by other transactions possible
– access should be delayed till commit/abort transaction
• how?
– New mechanism?
– (better) release of locks only at commit/abort
 strict two-phase locking
– locks held till end of transaction
November 2005
Distributed systems: shared data
64
Concurrency control: locking
• How increase concurrency and preserve
serial equivalence?
– Granularity of locks
– Appropriate locking rules
November 2005
Distributed systems: shared data
65
Concurrency control: locking
• Granularity of locks
– observations
• large number of data items on server
• typical transaction needs only a few items
• conflicts unlikely
– large granularity
limits concurrent access
• example: all accounts in a branch of bank are
locked together
– small granularity
 overhead
November 2005
Distributed systems: shared data
66
Concurrency control: locking
• Appropriate locking rules
– when conflicts?
operation by T operation by U
conflict
read
read
No
read
write
Yes
write
write
Yes
 Read & Write locks
November 2005
Distributed systems: shared data
67
Concurrency control: locking
• Lock compatibility
For one data
item
Lock
requested
Read
Write
Lock
None
OK
OK
already
Read
OK
Wait
set
Write
Wait
Wait
November 2005
Distributed systems: shared data
68
Concurrency control: locking
• Strict two-phase locking
– locking
• done by server (containing data item)
– unlocking
• done by commit/abort of the transactional service
November 2005
Distributed systems: shared data
69
Concurrency control: locking
• Use of locks on strict two-phase locking
– when an operation accesses a data item
• not locked yet
 lock set & operation proceeds
• conflicting lock set by another transaction
transaction must wait till ...
• non-conflicting lock set by another transaction
lock shared & operation proceeds
• locked by same transaction
lock promoted if necessary & operation proceeds
November 2005
Distributed systems: shared data
70
Concurrency control: locking
• Use of locks on strict two-phase locking
– when an operation accesses a data item
– when a transaction is committed/aborted
 server unlocks all data items locked for the
transaction
November 2005
Distributed systems: shared data
71
Concurrency control: locking
• Lock implementation
– lock manager
– managing table of locks:
•
•
•
•
transaction identifiers
identifier of (locked) data item
lock type
condition variable
– for waiting transactions
November 2005
Distributed systems: shared data
72
Concurrency control: locking
• Deadlocks
– a state in which each member of a group of
transactions is waiting for some other member
to release a lock
no progress possible!
– Example: with read/write locks
November 2005
Distributed systems: shared data
73
Concurrency control: locking
• Same example (lost update) with locking
Transaction T
Transaction U
Withdraw(A,4);
Withdraw(C,3);
Deposit(B,4);
Deposit(B,3);
Colour of data show owner of lock
November 2005
Distributed systems: shared data
74
Concurrency control: locking
• Read/write locks
Transaction T A  B: 4
bt := A.read();
Transaction U C  B: 3
A: 100
B: 200
C: 300
November 2005
Distributed systems: shared data
75
Concurrency control: locking
• Read/write locks
Transaction T A  B: 4
bt := A.read();
A.write(bt-4);
Transaction U C  B: 3
A: 100
B: 200
C: 300
November 2005
Distributed systems: shared data
76
Concurrency control: locking
• Read/write locks
Transaction T A  B: 4
bt := A.read();
A.write(bt-4);
Transaction U C  B: 3
A: 96
bu := C.read();
B: 200
C: 300
November 2005
Distributed systems: shared data
77
Concurrency control: locking
• Read/write locks
Transaction T A  B: 4
bt := A.read();
A.write(bt-4);
Transaction U C  B: 3
A: 96
bu := C.read();
B: 200
C.write(bu-3);
C: 300
November 2005
Distributed systems: shared data
78
Concurrency control: locking
• Read/write locks
Transaction T A  B: 4
bt := A.read();
A.write(bt-4);
Transaction U C  B: 3
A: 96
bu := C.read();
B: 200
C.write(bu-3);
bu := B.read();
C: 297
November 2005
Distributed systems: shared data
79
Concurrency control: locking
• Read/write locks
Transaction T A  B: 4
bt := A.read();
A.write(bt-4);
Transaction U C  B: 3
A: 96
bu := C.read();
B: 200
C.write(bu-3);
bt := B.read();
C: 297
November 2005
Distributed systems: shared data
bu := B.read();
80
Concurrency control: locking
• Read/write locks
Transaction T A  B: 4
bt := A.read();
A.write(bt-4);
Transaction U C  B: 3
A: 96
bu := C.read();
B: 204
C.write(bu-3);
bt := B.read();
Wait for release by U
C: 297
bu := B.read();
Wait for release by T
B.write(bt+4);
B.write(bu+3);
Deadlock!!
November 2005
Distributed systems: shared data
81
Concurrency control: locking
• Solutions to the Deadlock problem
– Prevention
• by locking all data items used by a transaction when
it starts
• by requesting locks on data items in a predefined
order
Evaluation
• impossible for interactive transactions
• reduction of concurrency
November 2005
Distributed systems: shared data
82
Concurrency control: locking
• Solutions to the Deadlock problem
– Detection
• the server keeps track of a wait-for graph
– lock: edge is added
– unlock: edge is removed
• the presence of cycles may be checked
– when an edge is added
– periodically
– example
November 2005
Distributed systems: shared data
83
Concurrency control: locking
• Read/write locks
Transaction T A  B: 4
bt := A.read();
A.write(bt-4);
Transaction U C  B: 3
A: 96
bu := C.read();
B: 200
C.write(bu-3);
bt := B.read();
C: 297
November 2005
Distributed systems: shared data
bu := B.read();
84
Concurrency control: locking
• Wait-for graph
A
C
Held by
T
U
B
November 2005
Distributed systems: shared data
85
Concurrency control: locking
• Read/write locks
Transaction T A  B: 4
bt := A.read();
A.write(bt-4);
Transaction U C  B: 3
A: 96
bu := C.read();
B: 204
C.write(bu-3);
bt := B.read();
Wait for release by U
C: 297
bu := B.read();
Wait for release by T
B.write(bt+4);
B.write(bu+3);
November 2005
Distributed systems: shared data
86
Concurrency control: locking
• Wait-for graph
A
C
Held by
T
U
B
November 2005
Distributed systems: shared data
87
Concurrency control: locking
• Wait-for graph
A
C
Held by
T
U
B
November 2005
Distributed systems: shared data
88
Concurrency control: locking
• Wait-for graph
T
U
B
Cycle  deadlock
November 2005
Distributed systems: shared data
89
Concurrency control: locking
• Solutions to the Deadlock problem
– Detection
• the server keeps track of a wait-for graph
• the presence of cycles must be checked
• once a deadlock detected, the server must select a
transaction and abort it (to break the cycle)
• choice of transaction? Important factors
– age of transaction
– number of cycles the transaction is involved in
November 2005
Distributed systems: shared data
90
Concurrency control: locking
• Solutions to the Deadlock problem
– Timeouts
• locks granted for a limited period of time
– within period: lock invulnerable
– after period: lock vulnerable
November 2005
Distributed systems: shared data
91
Overview
• Transactions
• Distributed transactions
–
–
–
–
–
Flat and nested distributed transactions
Atomic commit protocols
Concurrency in distributed transactions
Distributed deadlocks
Transaction recovery
• Replication
November 2005
Distributed systems: shared data
92
Distributed transactions
• Definition
Any transaction whose activities involve
multiple servers
• Examples
– simple: client accesses several servers
– nested: server accesses several other servers
November 2005
Distributed systems: shared data
93
Distributed transactions
• Examples: simple
X
client
Y
Z
November 2005
Serial execution of requests
on different server
Distributed systems: shared data
94
Distributed transactions
• Examples: nesting
T11
M
T12
X
client
T
T21
T1
T2
N
T22
Y
P
Z
November 2005
Serial or parallel execution of
requests on different servers
Distributed systems: shared data
95
Distributed transactions
• Examples:
M
X
T11
X
Client
T
Y
T
T1
N
T 12
T
T
21
T2
Client
Y
P
Z
T
22
November 2005
Distributed systems: shared data
96
Distributed transactions
• Commit: agreement between all servers involved
• to commit
• to abort
• take one server as coordinator
simple (?) protocol
– single point of failure?
• tasks of the coordinator
– keep track of other servers, called workers
– responsible for final decision
November 2005
Distributed systems: shared data
97
Distributed transactions
• New service operations:
– AddServer( TransID, CoordinatorID)
• called by clients
• first operation on server that has not joined the
transaction yet
– NewServer( TransID, WorkerID)
• called by new server on the coordinator
• coordinator records ServerID of the worker in its
workers list
November 2005
Distributed systems: shared data
98
Distributed transactions
coordinator
• Examples: simple
client
1. T := X$OpenTransaction();
2. X$Withdraw(A,4);
A
X
T := OpenTransaction();
B
X$Withdraw(A,4);
Y
Z$Deposit(C,4);
Y$Withdraw(B,3);
C,D
Z$Deposit(D,3);
Z
CloseTransaction(T);
November 2005
Distributed systems: shared data
99
Distributed transactions
coordinator
• Examples: simple
A
X
4. X$NewServer(T, Z);
client
T := OpenTransaction();
B
X$Withdraw(A,4);
Y
Z$Deposit(C,4);
3. Z$AddServer(T, X)
Y$Withdraw(B,3);
5. Z$Deposit(C,4);
C,D
Z$Deposit(D,3);
Z
CloseTransaction(T);
November 2005
Distributed systems: shared data
worker
100
Distributed transactions
coordinator
• Examples: simple
A
X
7. X$NewServer(T, Y);
client
T := OpenTransaction();
6. Y$AddServer(T, X)
X$Withdraw(A,4);
8. Y$Withdraw(B,3);
B
Y
Z$Deposit(C,4);
worker
Y$Withdraw(B,3);
C,D
Z$Deposit(D,3);
Z
CloseTransaction(T);
November 2005
Distributed systems: shared data
worker
101
Distributed transactions
coordinator
• Examples: simple
A
X
client
T := OpenTransaction();
B
X$Withdraw(A,4);
Y
Z$Deposit(C,4);
Y$Withdraw(B,3);
9. Z$Deposit(D, 3);
C,D
Z$Deposit(D,3);
Z
CloseTransaction(T);
November 2005
worker
Distributed systems: shared data
worker
102
Distributed transactions
coordinator
• Examples: simple
A
X
10. X$CloseTransaction(T);
client
T := OpenTransaction();
B
X$Withdraw(A,4);
Y
Z$Deposit(C,4);
worker
Y$Withdraw(B,3);
C,D
Z$Deposit(D,3);
Z
CloseTransaction(T);
November 2005
Distributed systems: shared data
worker
103
Distributed transactions
coordinator
• Examples: data at servers
A
X
client
Server
Trans
Role
B
Coord. Workers
X
T
Coord
(here)
Y
T
Worker
X
Z
T
Worker
X
Y, Z
Y
C,D
Z
November 2005
Distributed systems: shared data
worker
worker
104
Overview
• Transactions
• Distributed transactions
–
–
–
–
–
Flat and nested distributed transactions
Atomic commit protocols
Concurrency in distributed transactions
Distributed deadlocks
Transaction recovery
• Replication
November 2005
Distributed systems: shared data
105
Atomic Commit protocol
• Elements of the protocol
– each server is allowed to abort its part of a transaction
– if a server votes to commit it must ensure that it will
eventually be able to carry out this commitment
• the transaction must be in the prepared state
• all altered data items must be on permanent storage
– if any server votes to abort, then the decision must be
to abort the transaction
November 2005
Distributed systems: shared data
106
Atomic Commit protocol
• Elements of the protocol (cont.)
– the protocol must work correctly, even when
• some servers fail
• messages are lost
• servers are temporarily unable to communicate
November 2005
Distributed systems: shared data
107
Atomic Commit protocol
• Protocol:
– Phase 1: voting phase
– Phase 2: completion according to outcome of vote
November 2005
Distributed systems: shared data
108
Atomic Commit protocol
• Protocol
Coordinator
Worker
Step Status
1
prepared to commit
Step Status
3
(counting votes)
committed
CanCommit?
2
prepared to commit
Yes
DoCommit
4
done
November 2005
committed
HaveCommitted
Distributed systems: shared data
109
Atomic Commit protocol
• Protocol: Phase 1 voting phase
– Coordinator: for operation CloseTransaction
• sends CanCommit to each worker
• behaves as worker in phase 1
• waits for replies from workers
– Worker: when receiving CanCommit
• if for worker transaction can commit
– saves data items
– sends Yes to coordinator
• if for worker transaction cannot commit
– sends No to coordinator
– clears data structures, removes locks
November 2005
Distributed systems: shared data
110
Atomic Commit protocol
• Protocol: Phase 2
– Coordinator: collecting votes
Point of decision!!
• all votes Yes:
commit transaction; send DoCommit to workers
• one vote No:
abort transaction
– Worker: voted yes, waits for decision of coordinator
• receives DoCommit
– makes committed data available; removes locks
• receives AbortTransaction
– clears data structures; removes locks
November 2005
Distributed systems: shared data
111
Atomic Commit protocol
• Timeouts:
– worker did all/some operations and waits for
CanCommit
• unilateral abort possible
– coordinator waits for votes of workers
• unilateral abort possible
– worker voted Yes and waits for final decision
of coordinator
• wait unavoidable
• extensive delay possible
• additional operation GetDecision can be used to get
decision from coordinator or other workers
November 2005
Distributed systems: shared data
112
Atomic Commit protocol
• Performance:
–
–
–
–
C  W: CanCommit
N-1 messages
W  C: Yes/No
N-1 messages
C  W: DoCommit
N-1 messages
W  C: HaveCommitted
N-1 messages
+ (unavoidable) delays possible
November 2005
Distributed systems: shared data
113
Atomic Commit protocol
• Nested Transactions
– top level transaction & subtransactions
 transaction tree
November 2005
Distributed systems: shared data
114
Atomic Commit protocol
T11
M
T12
X
client
T
T21
T1
T2
N
T22
Y
P
Z
November 2005
Distributed systems: shared data
115
Atomic Commit protocol
• Nested Transactions
– top level transaction & subtransactions
 transaction tree
– coordinator = top level transaction
– subtransaction identifiers
• globally unique
• allow derivation of ancestor transactions
(why necessary?)
November 2005
Distributed systems: shared data
116
Atomic Commit protocol
• Nested Transactions: Transaction IDs
TID in example
actual TID
T
Z, nz
T1
Z, nz ; X, nx
T11
Z, nz ; X, nx ; M, nm
T2
Z, nz ; Y, ny
November 2005
Distributed systems: shared data
117
Atomic Commit protocol
• Upon completion of a subtransaction
– independent decision to commit or abort
– commit of subtransaction
• only provisionally
• status (including status of descendants) reported to
parent
• final outcome dependant on its ancestors
– abort of subtransaction
• implies abort of all its descendants
• abort reported to its parent (always possible?)
November 2005
Distributed systems: shared data
118
Atomic Commit protocol
• Data structures
– commit list: list of all committed
(sub)transactions
– aborts list: list of all aborted (sub)transactions
– example
November 2005
Distributed systems: shared data
119
Atomic Commit protocol
• Data structures: example
T
11
T1
provisional commit (at X)
T
T
provisional commit (at N)
T21
provisional commit (at N)
T
provisional commit (at P)
12
T
2
aborted (at Y)
22
November 2005
abort (at M)
Distributed systems: shared data
120
Atomic Commit protocol
• Data structures: example
T11
T1
M
X
T12
T
T21
Z
N
T2
T22
Y
P
November 2005
Distributed systems: shared data
121
Atomic Commit protocol
• Data structures: example
Server
Trans
Z
T
Child
Trans
Commit
List
Abort
List
X
Y
M
N
P
November 2005
Distributed systems: shared data
122
Atomic Commit protocol
• Data structures: example
T11
T1
M
X
T12
T
T21
Z
N
T2
T22
Y
P
November 2005
Distributed systems: shared data
123
Atomic Commit protocol
• Data structures: example
Server
Trans
Z
T
X
T1
Child
Trans
T1
Commit
List
Abort
List
Y
M
N
P
November 2005
Distributed systems: shared data
124
Atomic Commit protocol
• Data structures: example
T11
T1
M
X
T12
T
T21
Z
N
T2
T22
Y
P
November 2005
Distributed systems: shared data
125
Atomic Commit protocol
• Data structures: example
Server
Trans
Z
T
Child
Trans
T1
X
T1
T11
Commit
List
Abort
List
Y
M
T11
N
P
November 2005
Distributed systems: shared data
126
Atomic Commit protocol
• Data structures: example
T11
T1
abort
M
X
T12
T
T21
Z
N
T2
T22
Y
P
November 2005
Distributed systems: shared data
127
Atomic Commit protocol
• Data structures: example
Server
Trans
Z
T
Child
Trans
T1
X
T1
T11
Commit
List
Abort
List
T11
Y
M
T11
T11
N
P
November 2005
Distributed systems: shared data
128
Atomic Commit protocol
• Data structures: example
T11
T1
abort
M
X
T12
T
T21
Z
N
T2
T22
Y
P
November 2005
Distributed systems: shared data
129
Atomic Commit protocol
• Data structures: example
Server
Trans
Z
T
Child
Trans
T1
X
T1
T11 ,T12
Commit
List
Abort
List
T11
Y
M
T11
N
T12
T11
P
November 2005
Distributed systems: shared data
130
Atomic Commit protocol
• Data structures: example
T11
T1
abort
M
X
T12
commit
T
T21
Z
N
T2
T22
Y
P
November 2005
Distributed systems: shared data
131
Atomic Commit protocol
• Data structures: example
Server
Trans
Commit
List
Abort
List
T
Child
Trans
T1
Z
X
T1
T11 ,T12
T12
T11
Y
M
T11
N
T12
T11
T12
P
November 2005
Distributed systems: shared data
132
Atomic Commit protocol
• Data structures: example
T11
T1
X
abort
M
T12
commit
commit
T
T21
Z
N
T2
T22
Y
P
November 2005
Distributed systems: shared data
133
Atomic Commit protocol
• Data structures: example
Server
Trans
Commit
List
Abort
List
T
Child
Trans
T1
Z
X
T1
T11 ,T12
T12 , T1
T11
Y
M
T11
N
T12
T11
T12
P
November 2005
Distributed systems: shared data
134
Atomic Commit protocol
• Data structures: example
Server
Trans
T
Child
Trans
T1
Commit
List
T12 , T1
Abort
List
T11
Z
X
T1
T11 ,T12
T12 , T1
T11
Y
M
T11
N
T12
T11
T12
P
November 2005
Distributed systems: shared data
135
Atomic Commit protocol
• Data structures: example
T11
T1
X
abort
M
T12
commit
commit
T
T21
Z
N
T2
T22
Y
P
November 2005
Distributed systems: shared data
136
Atomic Commit protocol
• Data structures: example
Server
Trans
T
Child
Trans
T1 ,T2
Commit
List
T12 , T1
Abort
List
T11
Z
X
T1
T11 ,T12
T12 , T1
T11
Y
T2
M
T11
N
T12
T11
T12
P
November 2005
Distributed systems: shared data
137
Atomic Commit protocol
• Data structures: example
T11
T1
X
abort
M
T12
commit
commit
T
T21
Z
N
T2
T22
Y
P
November 2005
Distributed systems: shared data
138
Atomic Commit protocol
• Data structures: example
Server
Trans
T
Child
Trans
T1 ,T2
Commit
List
T12 , T1
Abort
List
T11
Z
X
T1
T11 ,T12
T12 , T1
T11
Y
T2
T21
M
T11
N
T12 ,T21
T11
T12
P
November 2005
Distributed systems: shared data
139
Atomic Commit protocol
• Data structures: example
T11
T1
X
abort
M
commit
T12
commit
T21
commit
T
Z
N
T2
T22
Y
P
November 2005
Distributed systems: shared data
140
Atomic Commit protocol
• Data structures: example
Server
Trans
T
Child
Trans
T1 ,T2
Commit
List
T12 , T1
Abort
List
T11
Z
X
T1
T11 ,T12
T12 , T1
T11
Y
T2
T21
T21
M
T11
N
T12 ,T21
T11
T12 ,T21
P
November 2005
Distributed systems: shared data
141
Atomic Commit protocol
• Data structures: example
T11
T1
X
abort
M
commit
T12
commit
T21
commit
T
Z
N
T2
T22
Y
P
November 2005
Distributed systems: shared data
142
Atomic Commit protocol
• Data structures: example
Server
Z
T
Child
Trans
T1 ,T2
X
T1
T11 ,T12
T12 , T1
Y
T2
T21 ,T22
T21
M
T11
N
T12 ,T21
P
T22
November 2005
Trans
Commit
List
T12 , T1
Abort
List
T11
T11
T11
T12 ,T21
Distributed systems: shared data
143
Atomic Commit protocol
• Data structures: example
T11
T1
X
abort
M
commit
T12
commit
T21
commit
T
Z
N
T2
T22
Y
commit
P
November 2005
Distributed systems: shared data
144
Atomic Commit protocol
• Data structures: example
Server
Trans
T
Child
Trans
T1 ,T2
Commit
List
T12 , T1
Abort
List
T11
Z
X
T1
T11 ,T12
T12 , T1
T11
Y
T2
T21 ,T22
T21 ,T22
M
T11
N
T12 ,T21
T12 ,T21
P
T22
T22
November 2005
T11
Distributed systems: shared data
145
Atomic Commit protocol
• Data structures: example
T11
T1
X
abort
M
commit
T12
commit
T21
commit
T
Z
N
T2
T22
Y
abort
commit
P
November 2005
Distributed systems: shared data
146
Atomic Commit protocol
• Data structures: example
Server
Trans
T
Child
Trans
T1 ,T2
Commit
List
T12 , T1
Abort
List
T11
Z
X
T1
T11 ,T12
T12 , T1
T11
Y
T2
T21 ,T22
T21 ,T22
T2
M
T11
N
T12 ,T21
T12 ,T21
P
T22
T22
November 2005
T11
Distributed systems: shared data
147
Atomic Commit protocol
• Data structures: example
Server
Trans
T
Child
Trans
T1 ,T2
Commit
List
T12 , T1
Abort
List
T11 ,T2
Z
X
T1
T11 ,T12
T12 , T1
T11
Y
T2
T21 ,T22
T21 ,T22
T2
M
T11
N
T12 ,T21
T12 ,T21
P
T22
T22
November 2005
T11
Distributed systems: shared data
148
Atomic Commit protocol
• Data structures: final data
Server
Trans
Child
Trans
Z
T
X
T1
Commit
List
T12 , T1
N, X
T12 , T1
N
T12 ,T21
T12 ,T21
P
T22
T22
Abort
List
T11 ,T2
T11
Y
M
November 2005
Distributed systems: shared data
149
Atomic Commit protocol
• Algorithm of coordinator
(flat protocol)
– Phase 1
• send CanCommit to each worker in commit list
– TransactionId: T
– abort list
• coordinator behaves as worker
– Phase 2
(as for non-nested transactions)
• all votes Yes:
commit transaction; send DoCommit to workers
• one vote No:
abort transaction
November 2005
Distributed systems: shared data
150
Atomic Commit protocol
• Algorithm of worker
(flat protocol)
– Phase 1 (after receipt of CanCommit)
• at least one (provisionally) committed descendant
of top level transaction:
– transactions with ancestors in abort list are aborted
– prepare for commit of other transactions
– send Yes to coordinator
• no (provisionally) committed descendant
– send No to coordinator
– Phase 2
November 2005
(as for non-nested transactions)
Distributed systems: shared data
151
Atomic Commit protocol
• Algorithm of worker (flat protocol)
– Phase 1 (after receipt of CanCommit)
– Phase 2 voted yes, waits for decision of coordinator
• receives DoCommit
– makes committed data available; removes locks
• receives AbortTransaction
– clears data structures; removes locks
November 2005
Distributed systems: shared data
152
Atomic Commit protocol
• Timeouts:
– same 3 as above:
• worker did all/some operations and waits for
CanCommit
• coordinator waits for votes of workers
• worker voted Yes and waits for final decision of
coordinator
– provisionally committed child with an aborted
ancestor:
• does not participate in algorithm
• has to make an enquiry itself
• when?
November 2005
Distributed systems: shared data
153
Atomic Commit protocol
• Data structures: final data
Server
Trans
Child
Trans
Z
T
X
T1
Commit
List
T12 , T1
N, X
T12 , T1
N
T12 ,T21
T12 ,T21
P
T22
T22
Abort
List
T11 ,T2
T11
Y
M
November 2005
Distributed systems: shared data
154
Atomic Commit protocol
• Data structures: example
T11
T1
X
abort
M
commit
T12
commit
T21
commit
T
Z
N
T2
T22
Y
abort
commit
P
November 2005
Distributed systems: shared data
155
Overview
• Transactions
• Distributed transactions
–
–
–
–
–
Flat and nested distributed transactions
Atomic commit protocols
Concurrency in distributed transactions
Distributed deadlocks
Transaction recovery
• Replication
November 2005
Distributed systems: shared data
156
Distributed transactions
Locking
• Locks are maintained locally (at each server)
– it decides whether
• to grant a lock
• to make the requesting transaction wait
– it cannot release the lock until it knows
whether the transaction has been
• committed
• aborted
at all servers
– deadlocks can occur
November 2005
Distributed systems: shared data
157
Distributed transactions
Locking
• Locking rules for nested transactions
– child transaction inherits locks from parents
– when a nested transaction commits, its locks are
inherited by its parents
– when a nested transaction aborts, its locks are removed
– a nested transaction can get a read lock when all the
holders of write locks (on that data item) are ancestors
– a nested transaction can get a write lock when all the
holders of read and write locks (on that data item) are
ancestors
November 2005
Distributed systems: shared data
158
Distributed transactions
Locking
• Who can access A?
T11
T1
A
M
X
T12
T
T21
Z
N
T2
T22
Y
P
November 2005
Distributed systems: shared data
159
Overview
• Transactions
• Distributed transactions
–
–
–
–
–
Flat and nested distributed transactions
Atomic commit protocols
Concurrency in distributed transactions
Distributed deadlocks
Transaction recovery
• Replication
November 2005
Distributed systems: shared data
160
Distributed deadlocks
• Single server approaches
– prevention: difficult to apply
– timeouts: value with variable delays?
Detection
• global wait-for-graph can be constructed from local
ones
• cycle in global graph possible without cycle in local
graph
November 2005
Distributed systems: shared data
161
Distributed transactions
Deadlocks
W
A
X
C
D
Z
U
V
B
Y
November 2005
Distributed systems: shared data
162
Distributed transactions
Deadlocks
• Algorithms:
– centralised deadlock detection: not a good idea
• depends on a single server
• cost of transmission of local wait-for graphs
– distributed algorithm:
• complex
• phantom deadlocks
 edge chasing approach
November 2005
Distributed systems: shared data
163
Distributed transactions
Deadlocks
• Phantom deadlocks
– deadlock detected that is not really a deadlock
– during deadlock detection
• while constructing global wait-for graph
• waiting transaction is aborted
November 2005
Distributed systems: shared data
164
Distributed transactions
Deadlocks
• Edge Chasing
– distributed approach to deadlock detection:
• no global wait-for graph is constructed
– servers attempt to find cycles
• by forwarding probes (= messages) that follow
edges of the wait-for graph throughout the
distributed system
November 2005
Distributed systems: shared data
165
Distributed transactions
Deadlocks
• Edge Chasing
– three steps:
• initiation: transaction starts waiting
– new probe constructed
• detection: probe received
– extend probe
– check for loop
– forward new probe
• resolution
November 2005
Distributed systems: shared data
166
Distributed transactions
Deadlocks
• Edge Chasing: initiation
– send out probe
TU
when transaction T starts waiting for U (and U
is already waiting for …)
– in case of lock sharing, different probes are
forwarded
November 2005
Distributed systems: shared data
167
Distributed transactions
Deadlocks
Initiation
W
A
X
C
W U
Z
U
V
B
Y
November 2005
Distributed systems: shared data
168
Distributed transactions
Deadlocks
• Edge Chasing: detection
– when receiving probe
TU
• Check if U is waiting
• if U is waiting for V (and V is waiting)
add V to probe T  U  V
• check for loop in probe?
– yes  deadlock
– no  forward new probe
November 2005
Distributed systems: shared data
169
Distributed transactions
Deadlocks
Initiation
W
A
X
C
Z
W U
WUV
U
V
WUVW
November 2005
B
Y
Distributed systems: shared data
170
Distributed transactions
Deadlocks
• Edge Chasing: resolution
– abort one transaction
– problem?
• Every waiting transaction can initiate deadlock
detection
• detection may happen at different servers
• several transactions may be aborted
– solution: transactions priorities
November 2005
Distributed systems: shared data
171
Distributed transactions
Deadlocks
• Edge Chasing: transaction priorities
– assign priority to each transaction, e.g. using
timestamps
– solution of problem above:
• abort transaction with lowest priority
• if different servers detect same cycle, the same
transaction will be aborted
November 2005
Distributed systems: shared data
172
Distributed transactions
Deadlocks
• Edge Chasing: transaction priorities
– other improvements
• number of initiated probe messages 
– detection only initiated when higher priority transaction
waits for a lower priority one
• number of forwarded probe messages 
– probes travel downhill -from transaction with high
priority to transactions with lower priorities
– probe queues required; more complex algorithm
November 2005
Distributed systems: shared data
173
Overview
• Transactions
• Distributed transactions
–
–
–
–
–
Flat and nested distributed transactions
Atomic commit protocols
Concurrency in distributed transactions
Distributed deadlocks
Transaction recovery
• Replication
November 2005
Distributed systems: shared data
174
Transactions and failures
• Introduction
– Approaches to fault-tolerant systems
• replication
– instantaneous recovery from a single fault
– expensive in computing resources
• restart and restore consistent state
– less expensive
– requires stable storage
– slow(er) recovery process
November 2005
Distributed systems: shared data
175
Transactions and failures
• Overview
– Stable storage
– Transaction recovery
– Recovery of the two-phase commit protocol
November 2005
Distributed systems: shared data
176
Transactions and failures
Stable storage
• Ensures that any essential permanent data will be
recoverable after any single system failure
• allow system failures:
– during a disk write
– damage to any single disk block
• hardware solution  RAID technology
• software solution:
– based on pairs of blocks for same data item
– checksum to determine whether block is good or bad
November 2005
Distributed systems: shared data
177
Transactions and failures
Stable storage
• Based on the following invariant:
– not more than one block of any pair is bad
– if both are good
• same data
• except during execution of write operation
• write operation:
– maintains invariant
– writes on both blocks are done strictly sequential
• restart of stable storage server after crash
 recovery procedure to restore invariant
November 2005
Distributed systems: shared data
178
Transactions and failures
Stable storage
• Recovery for a pair:
– both good and the same
 ok
– one good, one bad
 copy good block to bad block
– both good and different
 copy one block to the other
November 2005
Distributed systems: shared data
179
Transactions and failures
• Overview
– Stable storage
– Transaction recovery
– Recovery of the two-phase commit protocol
November 2005
Distributed systems: shared data
180
Transactions and failures
Transaction recovery
• atomic property of transaction implies:
– durability
• data items stored in permanent storage
• data will remain available indefinitely
– failure atomicity
• effects of transactions are atomic even when servers
fail
• recovery should ensure durability and
failure atomicity
November 2005
Distributed systems: shared data
181
Transactions and failures
Transaction recovery
• Assumptions about servers
– servers keep data in volatile storage
– committed data recorded in a recovery file
• single mechanism: recovery manager
– save data items in permanent storage for committed
transactions
– restore the server’s data items after a crash
– reorganize the recovery file to improve performance of
recovery
– reclaim storage space in the recovery file
November 2005
Distributed systems: shared data
182
Transactions and failures
Transaction recovery
• Elements of algorithm:
– each server maintains an intention list for all of its
active transactions: pairs of
• name
• new value
– decision of server: prepared to commit a transaction
 intention list saved in the recovery file (stable storage)
– server receives DoCommit
 commit recorded in recovery file
– after a crash: based on recovery file
• effects of committed transactions restored (in correct order)
• effects of other transactions neglected
November 2005
Distributed systems: shared data
183
Transactions and failures
Transaction recovery
• Alternative implementations for recovery file:
– logging technique
– shadow versions
• (see book for details)
November 2005
Distributed systems: shared data
184
Transactions and failures
• Overview
– Stable storage
– Transaction recovery
– Recovery of the two-phase commit protocol
November 2005
Distributed systems: shared data
185
Transactions and failures
two-phase commit protocol
• Server can fail during commit protocol
• each server keeps its own recovery file
• 2 new status values:
– done
– uncertain
November 2005
Distributed systems: shared data
186
Transactions and failures
two-phase commit protocol
• meaning of status values:
– committed:
• coordinator: outcome of votes is yes
• worker: protocol is complete
– done
• coordinator: protocol is complete
– uncertain:
• worker: voted yes; outcome unknown
November 2005
Distributed systems: shared data
187
Transactions and failures
two-phase commit protocol
• Recovery actions: (status@…) in recovery file
– prepared@coordinator
• no decision before failure of server
• send AbortTransaction to all workers
– aborted@coordinator
• send AbortTransaction to all workers
– committed@coordinator
• decision to commit taken before crash
• send DoCommit to all workers
• resume protocol
November 2005
Distributed systems: shared data
188
Transactions and failures
two-phase commit protocol
• Recovery actions: (status@…) in recovery file
– committed@worker
• send HaveCommitted to coordinator
– uncertain@worker
• send GetDecision to coordinator to get status
– prepared@worker
• not yet voted yes
• unilateral abort possible
– done@coordinator
• no action required
November 2005
Distributed systems: shared data
189
Overview
• Transactions
• Distributed transactions
• Replication
–
–
–
–
System model and group communication
Fault-tolerant services
Highly available services
Transactions with replicated data
November 2005
Distributed systems: shared data
190
Replication
• A technique for enhancing services
– Performance enhancement
– Increased availability
– Fault tolerance
• Requirements
– Replication transparency
– Consistency
November 2005
Distributed systems: shared data
191
Overview
• Transactions
• Distributed transactions
• Replication
–
–
–
–
System model and group communication
Fault-tolerant services
Highly available services
Transactions with replicated data
November 2005
Distributed systems: shared data
192
System model and group
communication
• Architectural model
Requests and
replies
C
Clients
Front ends
C
November 2005
RM
RM
FE
Service
FE
Distributed systems: shared data
RM
Replica
managers
193
System model and group
communication
• 5 phases in the execution of a request:
– FE issues requests to one or more RMs
– Coordination: needed to execute requests consistently
• FIFO
• Causal
• Total
– Execution: by all managers, perhaps tentatively
– Agreement
– Response
November 2005
Distributed systems: shared data
194
System model and group
communication
• Need for dynamic groups!
• Role of group membership service
– Interface for group membership changes: create/destroy groups, add process
– Implementing a failure detector: monitor group members
– Notifying members of group membership changes
– Performing group address expansion
• Handling network partitions: group is
– Reduced: primary-partition
– Split:
November 2005
partitionable
Distributed systems: shared data
195
System model and group
communication
Group
address
expansion
Group
send
Multicast
communication
Leave
Fail
Group membership
management
Join
Process group
November 2005
Distributed systems: shared data
196
System model and group
communication
• View delivery
– To all members when a change in membership occurs
– <> receive view
• Event occurring in a view v(g) at process p
• Basic requirements for view delivery
– Order:
if process p delivers v(g) and then v(g’)
then no process delivers v(g’) before v(g)
– Integrity:
if p delivers v(g) then p  v(g)
– Non-triviality: if q joins group and remains reachable
then eventually q  v(g) at p
November 2005
Distributed systems: shared data
197
System model and group
communication
• View-synchronous group communication
– Reliable multicast + handle changing group views
– Guarantees
• Agreement: correct processes deliver the same set of messages
in any given view
• Integrity:
if a process delivers m, it will not deliver it again
• Validity: if the system fails to deliver m to q
then other processes will deliver v’(g) (=v(g) –{q})
before delivering m
November 2005
Distributed systems: shared data
198
a (allowed).
System model and group
communication
b (allowed).
p crashes
p crashes
p
p
q
q
r
r
view (p, q, r)
view (q, r)
c (disallowed).
view (p, q, r)
d (disallowed).
p crashes
p crashes
p
p
q
q
r
r
view (p, q, r)
November 2005
view (q, r)
view (q, r)
view (p, q, r)
Distributed systems: shared data
view (q, r)
199
Overview
• Transactions
• Distributed transactions
• Replication
–
–
–
–
System model and group communication
Fault-tolerant services
Highly available services
Transactions with replicated data
November 2005
Distributed systems: shared data
200
Fault-tolerant services
• Goal:
provide a service that is correct
despite up to f process failures
• Assumptions:
– Communication reliable
– No network partitions
• Meaning of correct in case of replication
– Service keeps responding
– Clients cannot discover difference with ...
November 2005
Distributed systems: shared data
201
Fault-tolerant services
• Naive replication system:
Strange behavior:
– Clients read and update accounts at local replica
Client 2 sees 0 on account x and NOT 1
manager
– Clients try another 2replica
manager
on account
y in case of failure
– Replicaand
managers
propagate
updates
the background
update of
x has been
done in
earlier!!
• Example:
Client
1 managers: A and B
2 replica
setBalance
B(x,1) x and y
2 bank accounts:
Client 2
setBalance
A(y,2)1 will use B by preference
2 clients: client
client 2 will use A bygetBalance
preferenceA(y)  2
getBalanceA(x)  0
November 2005
Distributed systems: shared data
202
Fault-tolerant services
• Correct behaviour?
– Linearizability
• Strong requirement
– Sequential consistency
• Weaker requirement
November 2005
Distributed systems: shared data
203
Fault-tolerant services
• Linearizability
– Terminology:
• Oij: client i performs operation j
• Sequence of operations by one client: O20, O21, O22,...
• Virtual interleaving of operations performed by all clients
– Correctness requirements:  interleaved sequence ...
• Interleaved sequence of operations meets specification of a
(single) copy of the objects
• Order of operations in the interleaving is consistent with the
real times at which the operations occurred
– Real time?
• Yes, we prefer up-to-date information
• Requires clock synchronization: difficult
November 2005
Distributed systems: shared data
204
Fault-tolerant services
• Sequential consistency
– Correctness requirements:  interleaved sequence ... (red =
difference!)
• Interleaved sequence of operations meets specification of a
(single) copy of the objects
• Order of operations in the interleaving is consistent with the
program order in which each individual client executed them
– Example: sequential consistent not linearizable
Client 1
Client 2
setBalanceB(x,1)
getBalanceA(y)  0
November 2005
Distributed systems: shared data
setBalanceA(y,2)
getBalanceA(x)  0
205
Fault-tolerant services
• Passive (primary-backup) replication
Primary
C
FE
RM
RM
Backup
C
FE
RM
Backup
November 2005
Distributed systems: shared data
206
Fault-tolerant services
• Passive (primary-backup) replication
– Sequence of events for handling a client request:
• Request:
FE issues request with unique id to primary
• Coordination: request handled atomically in order; if
request already handled, re-send response
• Execution:
execute request and store response
• Agreement: primary sends updated state to backups and
waits for acks
• Response:
primary responds to FE; FE hands response
back to client
– Correctness: linearizability
– Failures?
November 2005
Distributed systems: shared data
207
Fault-tolerant services
• Passive (primary-backup) replication
– Failures?
• Primary uses view-synchronous group communication
• Linearizability preserved, if
– Primary replaced by a unique backup
– Surviving replica managers agree on which operations had been
performed at the replacement point
– Evaluation:
• Non-deterministic behaviour of primary supported
• Large overhead: view-synchronous communication required
• Variation of the model:
– Read requests handled by backups: linearizability  sequential
consistent
November 2005
Distributed systems: shared data
208
Fault-tolerant services
• Active replication
RM
C
FE
RM
FE
C
RM
November 2005
Distributed systems: shared data
209
Fault-tolerant services
• Active replication
– Sequence of events for handling a client request:
• Request:
•
•
•
•
FE does reliable TO-multicast(g, <m, i>) and
waits for reply
Coordination: every correct RM gets requests in same order
Execution: every correct RM executes the request;
all RMs execute all requests in the same order
Agreement: not needed
Response: every RM returns result to FE;
when return result to client?
– Crash failures: after first response from RM
– Byzantine failures: after f+1 identical responses from RMs
– Correctness: sequential consistency, not linearizability
November 2005
Distributed systems: shared data
210
Fault-tolerant services
• Active replication
– Evaluation
• Reliable + totally ordered multicast  solving consensus
 Synchronous system
Asynchronous + failure detectors
– Overhead!
• More performance
– Relax total order in case operations commute:
result of o1;o2 = result o2;o1
– Forward read-only request to a single RM
November 2005
Distributed systems: shared data
211
Overview
• Transactions
• Distributed transactions
• Replication
–
–
–
–
System model and group communication
Fault-tolerant services
Highly available services
Transactions with replicated data
November 2005
Distributed systems: shared data
212
Highly available services
• Goal
– Provide acceptable level of service
– Use minimal number of RMs
– Minimize delay for returning result
Weaker consistency
• Overview
– Coda
– Gossip Architecture
– Bayou
November 2005
Distributed systems: shared data
213
Highly available services
Coda
• Aims: constant data availability
– better performance, e.g. for bulletin boards,
databases,…
– more fault tolerance with increasing scale
– support mobile and portable computers
(disconnected operation)
Approach: AFS + replication
November 2005
Distributed systems: shared data
214
Highly available services
Coda
• Design AFS+
– file volumes replicated on different servers
– volume storage group (VSG) per file volume
– Available Volume Storage group (AVSG) per
file volume at a particular instance of time
– volume disconnected when AVSG is empty;
due to
• network failure, partitioning
• server failures
• deliberate disconnection of portable workstation
November 2005
Distributed systems: shared data
215
Highly available services
Coda
• Replication and consistency
– file version
• integer number associated with file copy
• incremented when file is changed
– Coda version vector (CVV)
• array of numbers stored with file copy on a
particular server (holding a volume)
• one value per volume in VSG
November 2005
Distributed systems: shared data
216
Highly available services
Coda
• Replication and consistency: example 1
– File F stored at 3 servers: S1, S2, S3
– Initial values for all CVVs: CVVi = [1,1,1]
– update by C1 at S1 and S2; S3 inaccessible
CVV1 = [2,2,1], CVV2 = [2,2,1], CVV3 = [1,1,1]
– network repaired  conflict detected
file copy at S3 updated
CVV1 = [2,2,2], CVV2 = [2,2,2], CVV3 = [2,2,2]
November 2005
Distributed systems: shared data
217
Highly available services
Coda
• Replication and consistency: example 2
– File F stored at 3 servers: S1, S2, S3
– Initial values for all CVVs: CVVi = [1,1,1]
– update by C1 at S1 and S2; S3 inaccessible
CVV1 = [2,2,1], CVV2 = [2,2,1], CVV3 = [1,1,1]
– update by C2 at S3 ; S1 and S2 inaccessible
CVV1 = [2,2,1], CVV2 = [2,2,1], CVV3 = [1,1,2]
– network repaired  conflict detected
manual intervention or ….
November 2005
Distributed systems: shared data
218
Highly available services
Coda
• Implementation
– On open
• Select one server from AVSG
• check CCV with all servers in AVSG
• files in replicated volume remain accessible to a client that can
access at least one of the replica
• load sharing over replicated volumes
– On close
• multicast file to AVSG
• update of CCV
– manual resolution of conflicts might be necessary
November 2005
Distributed systems: shared data
219
Highly available services
Coda
• Caching: update semantics
– successful open:
AVSG not empty and latest(F, AVSG, 0)
or
AVSG not empty and latest(F, AVSG, T) and
lostcallback(AVSG, T) and incache (F)
or
AVSG empty and incache (F)
November 2005
Distributed systems: shared data
220
Highly available services
Coda
• Caching: update semantics
– failed open:
AVSG not empty and conflict(F, AVSG)
or
AVSG empty and not incache(F)
– successful close:
AVSG not empty and updated(F, AVSG)
or
AVSG empty
– failed close:
AVSG not empty and conflict(F, AVSG)
November 2005
Distributed systems: shared data
221
Highly available services
Coda
• Caching: cache coherence
– relevant events to detect by Venus within T seconds of
their occurrence:
• enlargement of AVSG
• shrinking of AVSG
• lost callback event
– method: probe message to all servers in VSG of any
cached file every T seconds
November 2005
Distributed systems: shared data
222
Highly available services
Coda
• Caching: disconnected operation
– Cache replacement policy: e.g. least-recently used
– how support long disconnection of portables:
• Venus can monitor file referencing
• users can specify a prioritised list of files to retain on local
disk
– reintegration after disconnection
• priority for files on server
• client files in conflict are stored on covolumes; client is
informed
November 2005
Distributed systems: shared data
223
Highly available services
Coda
• Performance: Coda <> AFS
– No replication: no significant difference
– 3-fold replication & load for 5 users:
load +5%
– 3-fold replication & load for 50 users
load + 70% for Coda <> +16% for AFS
• Difference: replication + tuning?
• Discussion
– Optimistic approach to achieve high availability
– Use of semantics free conflict detection
(except file directories)
November 2005
Distributed systems: shared data
224
Highly available services
Gossip
• Goal of Gossip architecture
– Framework for implementing highly ...
– Replicate data close to points where groups of clients need it
• Operations:
– 2 types:
• Queries: read-only operations
• Updates: change state (do not read state)
– FE send operations to any RM
selection criterium: available + reasonable response time
– Guarantees
November 2005
Distributed systems: shared data
225
Highly available services
Gossip
• Goal of Gossip architecture
• Operations:
– 2 types: queries & updates
– FE send operations to any RM
– Guarantees:
• Each client obtains consistent service over time
– even when communicating with different RMs
• Relaxed consistency between replicas
– Weaker than sequential consistency
November 2005
Distributed systems: shared data
226
Highly available services
• Update ordering:
Gossip
– Causal least costly
– Forced (= total + causal)
– Immediate
• Applied in a consistent order relative to any other update at all
RMs, independent of order requested for other updates
• Choice
– Left to application designer
– Reflects trade-off between consistency and operation cost
– Implications for users
November 2005
Distributed systems: shared data
227
Highly available services
• Update ordering:
Gossip
– Causal least costly
– Forced (= total + causal)
– Immediate
• Applied in a consistent order relative to any other update at all
RMs, independent of order requested for other updates
• Example
electronic bulletin board:
– Causal: for posting items
– Forced: for adding new subscriber
– Immediate: for removing a user
November 2005
Distributed systems: shared data
228
Highly available services
Gossip
• Architecture
– Clients + FE/client
– Timestamps added to operations: in next figure
• Prev: reflects version of latest data values seen by client
• New: reflects state of responding RM
– Gossip messages:
• exchange of operations between RMs
November 2005
Distributed systems: shared data
229
Highly available services
Gossip
RM
Service
gossip
RM
Query, prev
Val, new
Update,prev
FE
Query
RM
Update id
FE
Val
Update
Clients
November 2005
Distributed systems: shared data
230
Highly available services
Gossip
• Sequence of events for handling a client request:
– Request:
• FE sends request to a single RM
• For query operation: Client blocked
• For update operation: FE returns to client asap; then forwards
operation to one RM or f+1 RMs for increased reliability
– Update response:
• if update operation, RM replies to FE after receiving the request
– Coordination:
• Request stored in log queue till it can be executed
• Gossip messages can be exchanged to update state in RM
– Execution:
November 2005
Distributed systems: shared data
231
Highly available services
Gossip
• Sequence of events for handling a client request:
–
–
–
–
Request
Update response
Coordination
Execution:
• RM executes request
– Query response
• If operation is query then RM replies at this point
– Agreement:
• Lazy update by RMs
November 2005
Distributed systems: shared data
232
Highly available services
Gossip
• Gossip internals
–
–
–
–
–
–
–
Timestamps at FEs
State of RMs
Handling of Query operations
Processing of update operations in causal order
Forced and immediate update operations
Gossip messages
Update propagation
November 2005
Distributed systems: shared data
233
Highly available services
Gossip
• Timestamps at FEs
–
–
–
–
Vector timestamps with entry for every RM
Local component updated at every operation
Returned timestamp merged with local one
Client-to client operations
• Via FEs
• Include timestamps (to preserve causal order)
November 2005
Distributed systems: shared data
234
Highly available services
Gossip
RM
Service
gossip
RM
Query, prev
Val, new
FE
Query
RM
Update,prev
ts
Val
Update id
FE
Update
Clients
November 2005
Distributed systems: shared data
235
Highly available services
• State of RMs
Gossip
– Value: state as maintained by RM
– Value timestamp: associated with value
– Update log: why log?
• Operation cannot be executed yet
• Operation has to be forwarded to other RMs
– Replica timestamp: reflects updates received by RM
– Executed operation table: to prevent re-execution
– Timestamp table:
• timestamps from other RMs
• Received with Gossip messages
• Used to check for consistency between RMs
November 2005
Distributed systems: shared data
236
Highly available services
Other replica managers
Gossip
Replica
timestamp
Gossip
messages
Replica log
Replica manager
Timestamp table
Value timestamp
Replica timestamp
Stable
Update log
Value
updates
Executed operation table
Updates
OperationID Update
FE
November 2005
Prev
FE
Distributed systems: shared data
237
Highly available services
Gossip
• Handling of Query operations
– Query request contains:
• q= operation
• q.prev = timestamp at FE
– If q.prev <= valueTS
then
operation can be executed
else
operation put in hold-back queue
– Result contains new timestamp
merged with timestamp at FE
November 2005
Distributed systems: shared data
238
Highly available services
Gossip
• Processing of update operations in causal order
– Update request u
• u.op specification of operation (type & parameters)
• u.prev: timestamp generated at FE
• u.id: unique identifier
– Handling of u at RMi
November 2005
Distributed systems: shared data
239
Highly available services
Gossip
• Processing of update operations in causal order
– Update request u = <u.op, u.prev, u.id>
– Handling of u at RMi
• u already processed?
• Increment i-the element in replicaTS
• Assign ts to u
ts[i] = replicaTS[i]
ts[k] = u.prev[k], k<>i
• Log record r = <i, ts, u.op, u.prev, u.id> added to log
• ts returned to FE
• If stability condition u.prev <= valueTS is satisfied
then value := apply(value, r.u.op)
valueTS := merge( valueTS, r.ts)
executed := executed  {r.u.id}
November 2005
Distributed systems: shared data
240
Highly available services
Gossip
• Forced and immediate update operations
– Special treatment
– Forced = total + causal order
• Unique global sequence numbers
by primary RM (reelection if necessary)
– Immediate
• Placed in sequence by primary RM (forced order)
• Additional communication between RMs
November 2005
Distributed systems: shared data
241
Highly available services
• Gossip messages
Gossip
– Gossip message m
• m.log = log
• m.ts = replicaTS
– Tasks done by RM when receiving m
• Merge m with its own log
– Drop r when r.ts <= replicaTS
– replicaTS := merge(replicaTS, m.ts)
• Apply updates that have become stable
• Eliminate records from log and executed operations table
November 2005
Distributed systems: shared data
242
Highly available services
Gossip
• Update propagation
– Gossip exchange frequency
• To be tuned by application
– Partner selection policy
• Random
• Deterministic
• Topological
– How much time will it take for all RMs to receive an
update
• Frequency and duration of network partitions
• Frequency for exchanging gossip-messages
• Policy for choosing partners
November 2005
Distributed systems: shared data
243
Highly available services
Gossip
• Discussion of architecture
+ Clients can continue to obtain a service even with
network partition
- Relaxed consistency guarantees
- Inappropriate for updating replicas in near-real time
? Scalability? Depends upon
• Number of updates in a gossip message
• Use read-only replicas
November 2005
Distributed systems: shared data
244
Highly available services
• Goal
Bayou
– Data replication for high availability
– Weaker guarantees than sequential consistency
– Cope with variable connectivity
• Guarantees: Every RM eventually
– receives the same set of updates
– applies those updates
November 2005
Distributed systems: shared data
245
Highly available services
• Approach
Bayou
– Use a domain specific policy for detecting and
resolving conflicts
– Every Bayou update contains
• Dependency check procedure
– Check for conflict if update would be applied
• Merge procedure
– Adapt update operation
» Achieves something similar
» Passes dependency check
November 2005
Distributed systems: shared data
246
Highly available services
Bayou
• Committed and tentative updates
– New updates are applied and marked as tentative
– Tentative updates can be undone and reapplied later
– final order decided by primary replica manager
 committed order
– Tentative update ti becomes next committed update
• Undo of all tentative updates after last committed update
• Apply ti
• Other tentative updates are reapplied
November 2005
Distributed systems: shared data
247
Highly available services
Bayou
Committed
c0
c1
c2
Tentative
cN
t0
t1
t2
ti ti+1
Tentative update ti becomes the next committed update
and is inserted after the last committed update cN.
November 2005
Distributed systems: shared data
248
Highly available services
Bayou
• Discussion
– Makes replication non-transparant to the application
• Exploits application’s semantics to increase availability
• Maintains replicated state as eventually sequentially consistent
– Disadvantages:
• Increased complexity for application programmer
• Increased complexity for user: returned results can be changed
– Suitable for applications ....
• conflicts rare
• Underlying data semantics simple
• Users can cope with tentative information
e.g. diary
November 2005
Distributed systems: shared data
249
Overview
• Transactions
• Distributed transactions
• Replication
–
–
–
–
System model and group communication
Fault-tolerant services
Highly available services
Transactions with replicated data
November 2005
Distributed systems: shared data
250
Transactions with replicated data
• Introduction
– Replicated transactional service
• each data item replicated at a group of servers
replica managers
• transparency  one-copy serializability
– Why?
• increase availability
• increase performance
– Advantages/disadvantages
• higher performance for read-only requests
• degraded performance op update requests
November 2005
Distributed systems: shared data
251
Transactions with replicated data
• Architectures for replicated transactions
– questions:
• Can a client send requests to any replica manager?
• How many replica managers are needed for a
successful completion of an operation?
• If one replica manager is addressed, can this one
delay forwarding till commit of transaction?
• How to carry out two-phase commit?
November 2005
Distributed systems: shared data
252
Transactions with replicated data
• Architectures for replicated transactions
– Replication Schemes:
• Read-one/Write-all
– a read request can be performed by a single replica
manager
– a write request must be performed by all replica managers
• Quorum Consensus
• Primary Copy
November 2005
Distributed systems: shared data
253
Transactions with replicated data
• Architectures for replicated transactions
– Replication Schemes:
• Read-one/Write-all
• Quorum Consensus
–
–
–
–
nr replica managers required to read data item
nw replica managers requited to update data item
nr + nw > number of replica managers
advantage: less managers needed for update
• Primary Copy
– all requests directed to a single server
– slaves can take over when primary fails
November 2005
Distributed systems: shared data
254
Transactions with replicated data
• Architectures for replicated transactions
– Forwarding update requests?
• Read-one/Write-all
– as soon as operation is received (only write requests)
• Quorum Consensus
– as soon as operation is received (read + write requests)
• Primary Copy
– when transaction commits
November 2005
Distributed systems: shared data
255
Transactions with replicated data
• Architectures for replicated transactions
– Two-phase commit
• locking (read one/write all)
– read operation: lock on single replica
– write operation: locks on all replica
– one copy serializability: assured as read and write have
conflicting lock on at least one replica
• protocol:
November 2005
Distributed systems: shared data
256
Transactions with replicated data
• Architectures for replicated transactions
– Two-phase commit
• locking (read one/write all)
• protocol:
– becomes two-level nested two-phase protocol
– first phase:
» worker receives CanCommit request
 request passed to all replica managers
 replies collected; one reply to coordinator
– second phase:
» same approach for DoCommit and Abort request
November 2005
Distributed systems: shared data
257
Transactions with replicated data
• Architectures for replicated transactions
– Failure implications?
• Read-one/Write-all
– updates impossible
 available copies replication
• Quorum Consensus
– quorum satisfaction still possible
• Primary Copy
– failure of primary  election new primary?
– Failure of slave  update delayed
November 2005
Distributed systems: shared data
258
Transactions with replicated data
• Available copies replication
– updates only performed by all available replica
managers (cfr. Coda)
– no failures:
• local concurrency control  one copy
serializability
– failures:
• additional concurrency control is needed
• example (next slides)
November 2005
Distributed systems: shared data
259
Transactions with replicated data
C2
T
U
GetBalance(A);
Deposit(B,3);
C1
B
A
X
A
B
N
Y
P
Replica managers
November 2005
B
M
Distributed systems: shared data
260
Transactions with replicated data
C2
T
U
GetBalance(B);
Deposit(A,3);
C1
B
A
X
A
B
N
Y
P
Replica managers
November 2005
B
M
Distributed systems: shared data
261
Transactions with replicated data
GetBalance(A);
Deposit(B,3);
T
C2
GetBalance(B);
U Deposit(A,3);
U waiting for A @ X
C1
T waiting for B @ N
Deadlock!
B
A
X
A
B
N
Y
P
Replica managers
November 2005
B
M
Distributed systems: shared data
262
Transactions with replicated data
GetBalance(A);
Deposit(B,3);
C2
T
C1
GetBalance(B);
U Deposit(A,3);
U can proceed; conflict not detected
T can proceed; conflict not detected
B
A
X
A
B
N
Y
P
Replica managers
November 2005
B
M
Distributed systems: shared data
263
Transactions with replicated data
• Available copies replication
– failures & additional concurrency control
• at commit time it is checked
– servers unavailable during transaction: still unavailable?
– Servers available during transaction: still available?
– If no  abort transaction
• implications for two-phase commit protocol?
November 2005
Distributed systems: shared data
264
Transactions with replicated data
• Replication and network partitions
– optimistic approach: available copies
• operations can go on in each partition
• conflicting transactions should be detected
and compensated for
• conflict detection:
– for file systems: version vector (see Coda)
– for read/write conflicts: ….
November 2005
Distributed systems: shared data
265
Transactions with replicated data
• Replication and network partitions (cont.)
– pessimistic approach: quorum consensus
• operations can go on in a single partition only
• R read quorum & W write quorum
– W > half of the votes
– R + W > total number of votes
• out of date copies should be detected
– version vectors
– timestamps
November 2005
Distributed systems: shared data
266
Transactions with replicated data
• Replication and network partitions (cont.)
– quorum consensus + available copies
• virtual partitions  advantages of both approaches
• virtual partition = abstraction of real partition
• transaction can operate in a virtual partition
– sufficient replica managers to have read & write quorum
– available copies is used in transaction
• virtual partition changes during transaction
abort of transaction
• member of virtual partition cannot access other member
create new partition
November 2005
Distributed systems: shared data
267
Overview
• Transactions
• Distributed transactions
• Replication
–
–
–
–
System model and group communication
Fault-tolerant services
Highly available services
Transactions with replicated data
November 2005
Distributed systems: shared data
268
Distributed Systems:
Shared Data
November 2005
Distributed systems: shared data
269