CS 347: Parallel and Distributed Data Management Notes07: Data Replication CS 347 Notes07 How often do nodes fail? CS 347 Notes07

Download Report

Transcript CS 347: Parallel and Distributed Data Management Notes07: Data Replication CS 347 Notes07 How often do nodes fail? CS 347 Notes07

CS 347:
Parallel and Distributed
Data Management
Notes07: Data Replication
CS 347
Notes07
1
How often do nodes fail?
CS 347
Notes07
2
How often do nodes fail?
• Example: disk drives
• Schroeder and Gibson. “Disk Failures in the
Real World: What Does and MTTF of
1,000,000 Hours Mean to You?” USENIX FAST
2007
– Typical drive replacement rate is 2-4% annually
– 1 PB = 1000 1 TB drives = 20 – 40 dead drives
– == A failure every couple of weeks!
CS 347
Notes07
3
Data Replication
• Reliable net, fail-stop nodes
• The model
Database
node 1 node 2
node 3
item
fragment
CS 347
Notes07
4
• Study one fragment, for time being
• Data replication  higher availability
CS 347
Notes07
5
Outline
• Basic Algorithms
• Improved (Higher Availability) Algorithms
• Multiple Fragments & Other Issues
CS 347
Notes07
6
Basic Solution (for C.C.)
• Treat each copy as an independent data item
Txi
Txj
Txk
Lock
mgr X1
Lock
mgr X2
Lock
mgr X3
X1
X2
X3
Object X has copies X1, X2, X3
CS 347
Notes07
7
• Read(X):
– get shared X1 lock
– get shared X2 lock
– get shared X3 lock
– read one of X1, X2, X3
– at end of transaction, release X1, X2, X3 locks
X1
X2
X3
lock
mgr
lock
mgr
read
lock
mgr
CS 347
Notes07
8
• Write(X):
– get exclusive X1 lock
– get exclusive X2 lock
– get exclusive X3 lock
– write new value into X1, X2, X3
– at end of transaction, release X1, X2, X3 locks
lock
lock
X1
lock
X2
write
CS 347
X3
write
Notes07
write
9
• Correctness OK
– 2PL  serializability
– 2PC  atomic transactions
• Problem: Low availability
down!
X1
X2
X3
 cannot access X!
CS 347
Notes07
10
Basic Solution — Improvement
• Readers lock and access a single copy
• Writers lock all copies
and update all copies
X1
X2
reader has lock
X3
writer will conflict!
• Good availability for reads
• Poor availability for writes
CS 347
Notes07
11
Reminder
• With basic solution
– use standard 2PL
– use standard commit protocols
CS 347
Notes07
12
Variation on Basic: Primary copy
reader
lock
X1 *
X2
X3
write
writer
• Select primary site (static for now)
• Readers lock and access primary copy
• Writers lock primary copy
and update all copies
CS 347
Notes07
13
Commit Options for Primary Site Scheme
• Local Commit
lock,
write, commit
writer
CS 347
X1 *
X2
X3
propagate update
Notes07
14
Commit Options for Primary Site Scheme
• Local Commit
lock,
write, commit
X1 *
X2
writer
X3
propagate update
...
Write(X):
• Get exclusive X1* lock
• Write new value into X1*
...
•Commit at primary; get sequence number
•Perform X2, X3 updates in sequence number order
CS 347
Notes07
15
Example t = 0
X1: 0 *
Y1: 0
Z1: 0
X2: 0
Y2: 0
Z2: 0
T1: X  1; Y  1;
T2: Y  2;
T3: Z  3;
CS 347
Notes07
16
Example t = 1
X1: 1 *
Y1: 1
Z1: 3
T1: X  1; Y  1;
T2: Y  2;
T3: Z  3;
CS 347
X2: 0
Y2: 0
Z2: 0
active at node 1
waiting for lock at node 1
active at node 1
Notes07
17
Example t = 2
X1: 1 *
Y1: 2
Z1: 3
X2: 0
Y2: 0
Z2: 0
#2: X  1; Y  1
T1: X  1; Y  1;
T2: Y  2;
T3: Z  3;
CS 347
committed
active at 1
committed
Notes07
#1: Z  3
18
Example t = 3
X1: 1 *
Y1: 2
Z1: 3
X2: 1
Y2: 2
Z2: 3
#1: Z  3
#2: X  1; Y  1
T1: X  1; Y  1;
T2: Y  2;
T3: Z  3;
CS 347
committed
committed
committed
Notes07
#3: Y  2
19
What good is RPWP-LC?
primary
backup
updates
CS 347
Notes07
backup
can’t read!
20
What good is RPWP-LC?
primary
backup
updates
backup
can’t read!
Answer: Can read “out-of-date” backup copy
(also useful with 1-safe backups... later)
CS 347
Notes07
21
Commit Options for Primary Site Scheme
• Distributed Commit
lock, compute
new value
writer
X1 *
X2
X3
prepare to write new value
ok
CS 347
Notes07
22
Commit Options for Primary Site Scheme
• Distributed Commit
commit (write)
lock, compute
new value
writer
X1 *
X2
X3
prepare to write new value
ok
CS 347
Notes07
23
Example
X1: 0 *
Y1: 0
Z1: 0
X2: 0
Y2: 0
Z2: 0
T1: X  1; Y  1;
T2: Y  2;
T3: Z  3;
CS 347
Notes07
24
Basic Solution
• Read lock all; write lock all: RAWA
• Read lock one; write lock all: ROWA
• Read and write lock primary: RPWP
– local commit: LC
– distributed commit: DC
CS 347
Notes07
25
Comparison
N = number of nodes with copies
P = probability that a node is operational
Probability
can read
Probability
can write
RAWA
ROWA
RPWP:LC
RPWP:DC
CS 347
Notes07
26
Comparison
N = number of nodes with copies
P = probability that a node is operational
RAWA
ROWA
RPWP:LC
RPWP:DC
CS 347
Probability
can read
Probability
can write
PN
N
1 - (1-P)
P
P
PN
N
P
P
PN
Notes07
27
Comparison
N = 5 = number of nodes with copies
P = 0.99 = probability that a node is operational
Read Prob.
RAWA
0.9510
ROWA
1.0000
RPWP:LC
0.9900
RPWP:DC
0.9900
CS 347
Notes07
Write Prob.
0.9510
0.9510
0.9900
0.9510
28
Comparison
N = 100 = number of nodes with copies
P = 0.99 = probability that a node is operational
Read Prob.
RAWA
0.3660
ROWA
1.0000
RPWP:LC
0.9900
RPWP:DC
0.9900
CS 347
Notes07
Write Prob.
0.3660
0.3660
0.9900
0.3660
29
Comparison
N = 5 = number of nodes with copies
P = 0.90 = probability that a node is operational
Read Prob.
RAWA
0.5905
ROWA
1.0000
RPWP:LC
0.9000
RPWP:DC
0.9000
CS 347
Notes07
Write Prob.
0.5905
0.5905
0.9000
0.5905
30
Outline
• Basic Algorithms
• Improved (Higher Availability) Algorithms
– Mobile Primary
– Available Copies
• Multiple Fragments & Other Issues
CS 347
Notes07
31
Mobile Primary (with RPWP)
primary
backup
backup
(1) Elect new primary
(2) Ensure new primary has seen all
previously committed transactions
(3) Resolve pending transactions
(4) Resume processing
CS 347
Notes07
32
(1) Elections
• Can be tricky...
• One idea:
– Nodes have IDs
– Largest ID wins
CS 347
Notes07
33
(1) Elections: One scheme
(a) Broadcast “I want to be primary, ID=X”
(b) Wait long enough so anyone with
larger ID can stop my takeover
(c) If I see “I want to be primary” message
with smaller ID, kill that takeover
(d) After wait without seeing bigger ID,
I am new primary!
CS 347
Notes07
34
(1) Elections: Epoch Number
It is useful to attach an epoch or version
number to messages:
primary: n3 epoch# = 1
primary: n5 epoch# = 2
primary: n3 epoch# = 3
...
CS 347
Notes07
35
(2) Ensure new primary has
previously committed transactions
primary
new primary
committed:
T1, T2
need to get
and apply:
T1, T2
backup
 How can we make sure new primary
is up to date? More on this coming up...
CS 347
Notes07
36
(3) Resolve pending transactions
primary
new primary
backup
T3?
T3 in
“W” state
T3 in
“W” state
CS 347
Notes07
37
Failed Nodes: Example
now:
primary
backup
backup
P1
P2
P3
commits T1
later:
later
still:
P1
-down-
P1
-down-
-downP2
-down-
P3
-down-
-down-
primary
backup
P2
P3
commit T2 (unaware of T1!)
CS 347
Notes07
38
RPWP:DC & 3PC take care of problem!
• Option A
– Failed node waits for:
• commit info from active node, or
• all nodes are up and recovering
• Option B
– Majority voting
CS 347
Notes07
39
RPWP: DC & 3PC
primary
backup 1
backup 2
(3)
(4)
(5)
(6)
time
(1) T end work
(2) send data
get acks
prepare
get acks
commit
• may use 2PC...
CS 347
Notes07
40
Node Recovery
• All transactions have commit
sequence number
• Active nodes save update values
“as long as necessary”
• Recovering node asks active primary for
missed updates; applies in order
CS 347
Notes07
41
Example: Majority Commit
state:
C
P
W
CS 347
C1
T1,T2,T3
T4
C2
C3
T1,T2
T3
T4
T1,T3
T2
T4
Notes07
42
Example: Majority Commit
t1:
t2:
t3:
t4:
t5:
t6:
t7:
C1 fails
C2 new primary
C2 commits T1, T2, T3; aborts T4
C2 resumes processing
C2 commits T5, T6
C1 recovers; asks C2 for latest state
C2 sends committed and pending
transactions; C2 involves C1 in any
future transactions
CS 347
Notes07
43
2-safe vs. 1-safe Backups
• Up to now we have covered
3/2-safe backups (RPWP:DC):
primary
T end work
send data
get acks
commit
CS 347
backup 2
time
(1)
(2)
(3)
(4)
backup 1
Notes07
44
Guarantee
• After transaction T commits at primary,
any future primary will “see” T
now:
primary
backup 1
backup 2
next primary
backup 2
T1, T2, T3
later:
primary
T1, T2, T3, T4
CS 347
Notes07
45
Performance Hit
• 3PC is very expensive
– many messages
– locks held longer (less concurrency)
[Note: group commit may help]
• Can use 2PC
– may have blocking
– 2PC still expensive
[up to 1 second reported]
CS 347
Notes07
46
Alternative: 1-safe (RPWP:LC)
• Commit transactions unilaterally at primary
• Send updates to backups as soon as possible
primary
T end work
T commit
send data
purge data
CS 347
backup 2
time
(1)
(2)
(3)
(4)
backup 1
Notes07
47
Problem: Lost Transactions
now:
primary
backup 1
backup 2
T1, T2, T3
T1
T1
primary
next primary
backup 2
T1, T4, T5
T1, T4
later:
CS 347
Notes07
48
Claim
• Lost transaction problem tolerable
– failures rare
– only a “few” transactions lost
CS 347
Notes07
49
Primary Recovery with 1-safe
• When failed primary recovers, need to
“compensate” for missed transactions
now:
primary
next primary
backup 2
T1, T2, T3
T1, T4, T5
T1, T4
backup 3
next primary
backup 2
T1, T4, T5
T1, T4, T5
later:
T1, T2, T3,
T3-1, T2-1, T4, T5
compensation
CS 347
Notes07
50
Log Shipping
• “Log shipping:” propagate updates to backup
primary
log
backup
• Backup replays log
• How to replay log efficiently?
– e.g., elevator disk sweeps
– e.g., avoid overwrites
CS 347
Notes07
51
So Far in Data Replication
• RAWA
• ROWA
• Primary copy
– static
• local commit
• distributed commit
– mobile primary
• 2-safe (distributed commit)
blocking or non-blocking
• 1-safe (local commit)
CS 347
Notes07
52
Outline
• Basic Algorithms
• Improved (Higher Availability) Algorithms
– Mobile Primary
– Available Copies
• Multiple Fragments & Other Issues
CS 347
Notes07
53
PC-lock available copies
*
down
primary
X1
X2
X3
X4
• Transactions write lock at all available copies
• Transactions read lock at any available copy
• Primary site (static) manages
U – set of available copies
CS 347
Notes07
54
Update Transaction
(1) Get U from primary
(2) Get write locks from U nodes
(3) Commit at U nodes
U={C0, C1}
C0
Primary
C1
Backup
C2
Backup
updates, 2PC
U
CS 347
Trans T3, U={C0, C1}
Notes07
55
A potential problem - example
Now:
U={C0, C1}
C0
Primary
I am recovering
C1
Backup
C2
Backup
-recoveringTrans T3, U={C0, C1}
CS 347
Notes07
56
A potential problem - example
Later:
U={C0, C1, C2}
C0
Primary
T3 updates
You missed T0, T1, T2
C1
Backup
C2
Backup
-recoveringT3 updates
Trans T3, U={C0, C1}
CS 347
Notes07
57
Solution:
• Initially transaction T gets copy of U’ of
U from primary (or uses cached value)
• At commit of T, check U’ with current U
at primary (if different, abort T)
CS 347
Notes07
58
Solution Continued
• When CX recovers:
– request missed and pending transactions
from primary (primary updates U)
– set write locks for pending transactions
• Primary polls nodes to detect failures
(updates U)
CS 347
Notes07
59
Example Revisited
You missed T0, T1, T2
U={C0, C1}
U={C0, C1, C2}
I am recovering
C0
Primary
C1
Backup
reject
C2
Backup
-recovering-
prepare
prepare
Trans T3, U={C0, C1}
CS 347
Notes07
60
Available Copies — No Primary
• Let all nodes have a copy of U
(not just primary)
• To modify U, run a special atomic
transaction at all available sites
(use commit protocol)
– E.g.: U1={C1, C2}  U2={C1, C2 , C3}
only C1, C2 participate in this transaction
– E.g.: U2={C1, C2 , C3}  U3={C1, C2}
only C1, C2 participate in this transaction
CS 347
Notes07
61
• Details are tricky...
• What if commit of U-change blocks?
CS 347
Notes07
62
Node Recovery (no primary)
• Get missed updates from any active node
• No unique sequence of transactions
• If all nodes fail, wait for - all to recover
- majority to recover
CS 347
Notes07
63
Example
recovering node
Committed:
A,B,C,D,E,F
Committed:
A,B
Pending: G
Committed:
A,C,B,E,D
Pending: F,G,H
 How much information (update values) must be
remembered? By whom?
CS 347
Notes07
64
Correctness with replicated data
X2
X1
S1: r1[X1]  r2[X2]  w1[X1]  w2[X2]
 Is this schedule serializable?
CS 347
Notes07
65
X2
X1
S1: r1[X1]  r2[X2]  w1[X1]  w2[X2]
 Is this schedule serializable?
One idea: Require transactions to update all copies
S1: r1[X1]  r2[X2]  w1[X1]  w2[X2] w1[X2]  w2[X1]
CS 347
Notes07
66
X2
X1
S1: r1[X1]  r2[X2]  w1[X1]  w2[X2]
 Is this schedule serializable?
One idea: Require transactions to update all copies
S1: r1[X1]  r2[X2]  w1[X1]  w2[X2] w1[X2]  w2[X1]
(not a good idea for high-availability algorithms)
Another idea: Build in copy-semantics into notion
of serializability
CS 347
Notes07
67
One copy serializable (1SR)
A schedule S on replicated data is 1SR if
it is equivalent to a serial history of the
same transactions on a one-copy
database
CS 347
Notes07
68
To check 1SR
• Take schedule
• Treat ri[Xj] as ri[X]
Xj is copy of X
wi[Xj] as wi[X]
• Compute P(S)
• If P(S) acyclic, S is 1SR
CS 347
Notes07
69
Example
S1: r1[X1]  r2[X2]  w1[X1]  w2[X2]
S1’: r1[X]  r2[X]  w1[X]  w2[X]
T2T1
T1T2
S1 is not 1SR!
CS 347
Notes07
70
Second example
S2: r1[X1]  w1[X1]  w1[X2]
r2[X1]  w2[X1]  w2[X2]
S2’: r1[X]  w1[X]  w1[X]
r2[X]  w2[X]  w2[X]
P(S2): T1  T2
S2 is 1SR
CS 347
Notes07
71
Second example
S2: r1[X1]  w1[X1]  w1[X2]
r2[X1]  w2[X1]  w2[X2]
S2’: r1[X]  w1[X]  w1[X]
r2[X]  w2[X]  w2[X]
• Equivalent serial schedule
SS: r1[X]  w1[X]
r2[X]  w2[X]
CS 347
Notes07
72
Question: Is this a “good” schedule?
S3: r1[X1]  w1[X1]  w1[X2]
r2[X1]  w2[X1]
CS 347
Notes07
73
Question: Is this a “good” schedule?
S3: r1[X1]  w1[X1]  w1[X2]
r2[X1]  w2[X1]
S3: r1[X]  w1[X]  w1[X]
r2[X]  w2[X]
to be valid schedule,
need precedence edge
between w1(X) and w2(X)
S3: r1[X]  w1[X]
r2[X]  w2[X]
CS 347
Notes07
74
Question: Is this a “good” schedule?
We need to know how w2(X2) is resolved:
OK:
S3: r1[X1]  w1[X1]  w1[X2]
r2[X1]  w2[X1]
w2[X2]
Not OK:
S3: r1[X1]  w1[X1]  w1[X2]
r2[X1]  w2[X1]
CS 347
Notes07
w2[X2]
75
Question: Is this a “good” schedule?
S3: r1[X1]  w1[X1]  w1[X2]
r2[X1]  w2[X1]
Bottom Line: When w2[X2] is missing because
X2 is down, assume X2 recover will perform
w2[X2] in correct order
S3: r1[X]  w1[X]
r2[X]  w2[X]
CS 347
Notes07
76
Another example:
• S3 continues with T3:
S4: r1[X1]  w1[X1]  w1[X2]  r3[X2]  w3[X1]
r2[X1]  w2[X1]
CS 347
Notes07
w3[X2]
77
Another example:
S4: r1[X1]  w1[X1]  w1[X2]  r3[X2]  w3[X1]
r2[X1]  w2[X1]
w3[X2]
S4: r1[X]  w1[X]  w1[X]  r3[X]  w3[X]
r2[X]  w2[X]
w3[X]
S4: r1[X]  w1[X]  r3[X]  w3[X]
r2[X]  w2[X]
CS 347
Notes07
Seems OK but where
do we place missing
w2[X2]?
78
Another example:
w2[X2] must be before w1[X2]
or after r3[x2] (else t3 read different)
S4: r1[X1]  w1[X1]  w1[X2]  r3[X2]  w3[X1]
r2[X1]  w2[X1]
CS 347
Notes07
w3[X2]
79
Another example:
One option:
w2[X2] must be before w1[X2]
or after r3[x2] (else t3 read different)
S4: r1[X1]  w1[X1]  w1[X2]  r3[X2]  w3[X1]
r2[X1]  w2[X1]
w3[X2]
w2[X2]
performed by X2 recovery
CS 347
Notes07
80
Outline
• Basic Algorithms
• Improved (Higher Availability) Algorithms
– Mobile Primary
– Available Copies (and 1SR)
• Multiple Fragments & Other Issues
CS 347
Notes07
81
Multiple fragments
Fragment 1
Fragment 2
Node 1
Node 2
Node 3
Node 4
• A transaction spanning multiple fragments must
- follow locking rules for each fragment
- commit must involve “majority” in each fragment
CS 347
Notes07
82
• Careful with update transactions that
read but do not modify a fragment
Example:
F1
C1
C2
C3
T1
F2
T2
Read lock
F1 at C1
CS 347
C4
Read lock
F2 at C3
Notes07
83
C1,C3 fail…
F1
C1
C2
C3
T2
C4
Writes F1 at C2
T1
Commits F1 at C2
Writes F2 at C4
F2
Commits F2 at C4
CS 347
Notes07
84
Equivalent history:
r1[F1] r2[F2] w1[F2] w2[F1]
not serializable!
Solution: commit at read sites too
CS 347
Notes07
85
C1,C3 fail…
F1
C1
C2
C3
T2
C4
Writes F1 at C2
T1
Commits F1 at C2
Writes F2 at C4
F2
Commits F2 at C4
cannot commit at F1
because U= {C1} is out of date...
CS 347
Notes07
86
Read-Only Transactions
• Can provide “weaker correctness”
• Does not impact values stored in DB
C2: backup
C1: primary
0
A:
B:
A:
B:
0
0
0
T1: A  3
T2: B  5
CS 347
Notes07
87
T1: A  3
T2: B  5
Later on:
C1: primary
0 3
A:
B:
0 5
C2: backup
B5
0
:A
:B
R2 read transaction
at backup sees “old”
but “valid” state
R1 read transaction
sees current sate
CS 347
0 3
Notes07
88
States Are Equivalent
• States at Primary:
• no transactions
• T1
• T1, T2
• T1, T2, T3
• States at Backup:
• no transactions
• T1
• T1, T2
• T1, T2, T3
CS 347
Notes07
89
States Are Equivalent
• States at Primary:
• no transactions
• T1
• T1, T2
• T1, T2, T3
at this point in time,
backup may be behind...
• States at Backup:
• no transactions
• T1
• T1, T2
• T1, T2, T3
CS 347
Notes07
90
Schedule is Serializable
• S1 = T1 R1 T2 T3 R2 T4 ...(R1)...(R2)...
CS 347
Notes07
91
Example 2
• A, B have different primaries now
• 1-safe protocol used
C2
C1
primary
0
B:
0
:A
B:
0
:A
0
primary
T1: A  3
T2: B  5
CS 347
Notes07
92
T1: A  3
T2: B  5
C2
C1
primary
CS 347
0 3
A3
0
0
B5
0 5
Notes07
primary
93
T1: A  3
T2: B  5
C2
C1
primary
0 3
A3
0
0
B5
0 5
primary
At this time:
• Q1: reads A, B at C1; sees T1 Q1 T2
• Q2: reads A, B at C2; sees T2 Q2 T1
CS 347
Notes07
94
Eventually:
C2
C1
primary
B:
0 3
:A
B:
0 5
0 3
:A
0 5
primary
• Schedule of update transactions is OK:
– T1 T2  T2 T1
• Each R.O.T. sees OK schedule:
– T1 Q1 T2 or T2 Q2 T1
• But there is NO single complete schedule that
is “OK”...
CS 347
Notes07
95
• In many cases, such a scenario is OK
• Called weak serializability:
– update schedule is serializable
– R.O.T. see committed data
CS 347
Notes07
96
Data Replication
• RAWA, ROWA
• Primary copy
– static [local commit or distributed commit]
– mobile primary [2-safe (2PC or 3PC) or 1-safe]
•
•
•
•
Available copies [with or without primary]
Correctness (1SR)
Multiple Fragments
Read-Only Transactions
CS 347
Notes07
97
Issues
• To centralize control or not?
• How much availability?
• “Weak” reads OK?
CS 347
Notes07
98
Quorum protocols
• All decisions made by majority (“quorum”)
• Any live quorum can make progress
• Example: Paxos
– Phase 1:
• Leader asks for votes
• Quorum votes for leader
– Phase 2
• Leader proposes a write
• Quorum acks the proposal
CS 347
Notes07
99