Haryadi Gunawi UC Berkeley Research FATE background and DESTINI Local storag e Storage Servers e.g. Google Laptop Cloud Storage.
Download
Report
Transcript Haryadi Gunawi UC Berkeley Research FATE background and DESTINI Local storag e Storage Servers e.g. Google Laptop Cloud Storage.
Haryadi Gunawi
UC Berkeley
1
Research
FATE
background
and DESTINI
2
Local
storag
e
Storage
Servers
3
e.g.
Google
Laptop
Cloud Storage
4
Internet-service FS:
GoogleFS, HadoopFS, CloudStore, ...
Key-Value Store:
+ Replication
Cassandra,
Voldemort, ...
+
Scale-up
Structured
Storage:
Yahoo! PNUTS,
Google BigTable, Hbase, ...
+ Migration
“This is not just data.
+Custom
...
Storage:
It’s my life.
And I would be sick
if I lost it”
[CNN ’10]
Facebook Haystack Photo Store,
Microsoft StarTrack (Map Apps),
Amazon S3, EBS, ...
5
Cloudy
with a chance
of
failure
6
Research
FATE
background
and DESTINI
Motivation
FATE
DESTINI
Evaluation
Joint work with:
Thanh Do, Pallavi Joshi, Peter
Alvaro, Joseph Hellerstein, Andrea
Arpaci-Dusseau, Remzi ArpaciDusseau, Koushik Sen, Dhruba
Borthakur
7
Cloud
Thousands of commodity machines
“Rare (HW) failures become frequent” [Hamilton]
Failure
recovery
“… has to come from the software” [Dean]
“… must be a first-class op” [Ramakrishnan et
al.]
8
Google
Chubby (lock service)
Four occasions of data loss
Whole-system down
More problems after injecting multiple failures
(randomly)
Google
BigTable (key-value store)
Chubby bug affects BigTable availability
More
details?
How often? Other problems and implications?
9
Open-source
cloud projects
Very valuable bug/issue
repository!
HDFS (1400+), ZooKeeper
(900+), Cassandra (1600+),
Hbase (3000+), Hadoop (7000+)
HDFS
JIRA study
Implications
Count
loss
13
1300 issues, 4 years (AprilData
2006
to July 2010)
Select recovery problems Unavailability
due to hardware48
Corruption
19
failures
10
91 recovery bugs/issues Misc.
10
Testing
is not advanced enough
[Google]
Failure model: multiple, diverse failures
Recovery
is under-specified [Hamilton]
Lots of custom recovery
Implementation is complex
Need
two advancements:
Exercise complex failure modes
Write specifications and test the
implementation
11
FATE
Failure Testing
Service
Cloud software
X
X
1
2
DESTINI
Declarative
Testing
Specifications
Violate
specs?
12
Research
FATE
background
and DESTINI
Motivation
FATE
-
Architecture
Failure exploration
DESTINI
Evaluation
13
M
HadoopFS
(HDFS)
Write
Protocol
C
1
2
Alloc
Req
Data
Transfe
r
No failures
M
C
1
3
Setu
p
Stag
e
2
3
4
M
C
1
2
3
X
1
X
2
Data Transfer Recovery:
Setup Recovery:
Recreate fresh pipeline (1, 2, 4) Continue on surviving nodes (1, 2)
14
Failures
Anytime: different stages different recovery
Anywhere: N2 crash, and then N3
Any type: bad disks, partitioned nodes/racks
FATE
Systematically exercise multiple, diverse failures
How? need to “remember” failures – via failure IDs
M
C
1
2
3
M
C
1
2
3
4
15
Abstraction of I/O failures
Building failure IDs
Intercept every I/O
Inject possible failures
-
Node2
Ex: crash, network partition, disk failure (LSE/corruption)
Node3
X
Note:
FIDs
A, B, C, ...
OutputStream.read() in
BlockReceiver.java
I/O
information
<stack trace>
Net I/O from N3 to N2
“Data Ack”
Failure
Crash After
Failure ID: 25
16
Workload
Driver
while (new FIDs)
{
hdfs.write()
}
Target system
(e.g. Hadoop FS)
AspectJ
Failure
Surface
I/O
Info
Failure
Server
(fail/no
fail?)
Java SDK
17
1 failure / run
M
C
1
2
Exp #1: A
A
Exp #2: B
A
2 failures / run
M
3
2
A
B
BC
C
3
A
AC
A
B
1
AB
B
Exp #3: C
C
C
A
C
18
Research
FATE
background
and DESTINI
Motivation
FATE
-
Architecture
Failure exploration challenge and solution
DESTINI
Evaluation
19
Exercised over 40,000 unique combinations of 1,
2, and 3 failures per run
80 hours of testing time!
New challenge: Combinatorial explosion
Need smarter exploration strategies
1
2
3
A
A
A
1
2
3
B
B
B
1
2
3
2 failures / run
A1 A2
A1 B2
B1 A2
B1 B2
...
20
Properties
of multiple failures
Pairwise dependent failures
Pairwise independent failures
Goal:
exercise distinct recovery
behaviors
Key: some failures result in similar recovery
Result: > 10x faster, and found the same bugs
21
Failure dependency graph
Recovery clustering
FID Subseq
Inject single failures first
FIDs
Record subsequent dependent IDs
A X
- Ex: X depends on A
B X
Brute-force: AX, BX, CX, DX, CY, DY C X, Y
D X, Y
Two clusters: {X} and {X, Y}
Only exercise distinct clusters
Pick a failureID that triggers the
recovery cluster
Results: AX, CX, CY
A
B
X
C
D
Y
22
Independent
combinations
FP2 x N (N – 1)
FP = 2, N = 3, Tot =
24
Symmetric
code
Just pick two nodes
N (N – 1) 2
FP2 x 2
1
2
3
A
A
A
1
2
3
B
B
B
1
2
3
1
2
3
A
A
A
1
2
3
B
B
B
1
2
3
23
FP2
bottleneck
FP = 4, total = 16
Real example, FP = 15
Recovery
clustering
Cluster A and B if:
fail(A) == fail(B)
Reduce FP2 to FP2clustered
E.g.15 FPs to 8 FPs clustered
A
A
1
2
B
B
1
2
C
C
1
2
D
D
1
2
A
A
1
2
B
B
1
2
C
C
1
2
D
D
1
2
24
Exercise
multiple, diverse failures
Via failure IDs abstraction
Challenge: combinatorial explosion of multiple
failures
Smart
failure exploration strategies
> 10x improvement
Built on top of failure IDs abstraction
Current
limitations
No I/O reordering
Manual workload setup
25
Research
FATE
background
and DESTINI
Motivation
FATE
DESTINI
-
Overview
Building specifications
Evaluation
26
Is
the system correct under failures?
Need to write specifications
[It is] great to document (in a
spec) the HDFS write protocol ...
…, but we shouldn't spend too
much time on it, … a formal
spec may be overkill for a
protocol we plan to deprecate
imminently.
Specs
Implementation X
X
2
1
27
How
to write specifications?
Developer friendly (clear, concise, easy)
Existing approaches:
-
-
Unit test (ugly, bloated, not formal)
Others too verbose and long
Declarative
relational logic language
(Datalog)
Key: easy to express logical relations
28
How to write specs?
Violations
Expectations
Facts
How to write recovery specs?
Specs
Implementation
“... recovery is under specified” [Hamilton]
Precise failure events
Precise check timings
How to test implementation?
Interpose I/O calls (lightweight)
Deduce expectations and facts from I/O events
29
Research
FATE
background
and DESTINI
Motivation
FATE
DESTINI
-
Overview
Building specifications
Evaluation
30
“Throw a violation if
an expectation is different
from
the actual behavior”
violationTable(…) :expectationTable(…),
NOT-IN actualTable(…)
Datalog syntax:
head() :- predicates(), …
:- derivation
, AND
31
M
C
1
2
3
X
B
incorrectNodes
(Block, Node)
“Replicas should exist
in surviving nodes”
Data
Transfe
r
B
expectedNodes
(Block, Node)
actualNodes
(Block, Node)
B
Node 1
B
Node 1
B
Node 2
B
Node 2
incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN
actualNodes(B, N);
32
M
C
1
2
3
X
B
incorrectNodes
(Block, Node)
B
Node 2
B
expectedNodes
(Block, Node)
B
Node 1
B
Node 2
actualNodes
(Block, Node)
B
Node 1
incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN
actualNodes(B, N);
33
M
C
1
2
3
Ex: which nodes should
have the blocks?
X
Deduce expectations
from I/O events (italic)
2
expectedNodes
(Block, Node)
M
getBlockPipe(…)
Give me 3 nodes for B
C
[Node1, Node2, Node3]
B
Node 1
B
Node 2
B
Node 3
expectedNodes (B, N)
:getBlockPipe (B, N);
#1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);
34
expectedNodes
(Block, Node)
M
B
Node 1
B
Node 2
B
Node 3
C
1
2
DEL expectedNodes (B, N) :expectedNodes (B, N),
fateCrashNode (N)
3
X
B
B
#1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);
#2: expectedNodes(B, N) :- getBlockPipe(B,N)
35
DEL expectedNodes (B, N) :Different stages
expectedNodes (B, N),
different recovery
fateCrashNode (N),
behaviors
Precise failure events writeStage (B, Stage),
Data
Stage == “Data Transfer”;
transfer
recovery
#1: incorrectNodes(B,N)
:- expectedNodes(B,N), NOT-IN actualNodes(B,N)
M
C
1
2
3
#2: expectedNodes(B,N) :- getBlockPipe(B,N)
#3: expectedNodes(B,N)
:- expectedNodes(B,N), fateCrashNode(N),
vs.
writeStage(B,Stage),
Stage == “DataTransfer”
M
C
1
2
3
4
#4: writeStage(B,
:- writeStage(B,“Setup”), nodesCnt(Nc), acksCnt
Setup “DataTr”)
(Ac), Nc==Ac
stage
#5: nodesCnt (B, CNT<N>)
:- pipeNodes (B, N);
recovery(B, N)
#6: pipeNodes
:- getBlockPipe (B, N);
#7: acksCnt (B, CNT<A>)
:- setupAcks (B, P, “OK”);
#8: setupAcks (B, P, A)
:- setupAck (B, P, A);
36
Ex:
M
C
1
2
3
X
B
which nodes
actually store the
valid blocks?
Deduced from disk I/O
events in 8 datalog
rules
actualNodes
(Block, Node)
B
#1: incorrectNodes(B,N)
Node 1
:- expectedNodes(B,N), NOT-IN actualNodes(B,N)
37
#1:
Recovery ≠ invariant
incorrectNodes(B, N) : If recovery is ongoing,
expectedNodes(B, N),
invariants are violated
NOT-IN actualNodes(B, N),
Don’t want false alarms
completeBlock (B);
Need precise check
timings
Ex: upon block completion
38
r1
incorrectNodes (B, N)
:-
cnpComplete (B), expectedNodes (B, N), NOT-IN actualNodes (B, N);
r2
pipeNodes (B, Pos, N)
:-
getBlkPipe (UFile, B, Gs, Pos, N);
r3
expectedNodes (B, N)
:-
getBlkPipe (UFile, B, Gs, Pos, N);
r4
DEL expectedNodes (B, N)
:-
fateCrashNode (N), pipeStage (B, Stg), Stg == 2, expectedNodes (B, N);
:-
cdpSetupAck (B, Pos, Ack);
:-
setupAcks (B, Pos, Ack), Ack == ’OK’;
r7
Failure
setupAcks (B, Pos, Ack)
event
goodAcksCnt (B, COUNT<Ack>)
(crash)
nodesCnt (B, COUNT<Node>)
:-
pipeNodes (B, , N, );
r8
pipeStage (B, Stg)
:-
nodesCnt (NCnt), goodAcksCnt (ACnt), NCnt == Acnt, Stg := 2;
r9
blkGenStamp (B, Gs)
:-
dnpNextGenStamp (B, Gs);
r10
blkGenStamp (B, Gs)
:-
cnpGetBlkPipe (UFile, B, Gs, , );
r11
diskFiles (N, File)
:-
fsCreate (N, File);
r12
diskFiles (N, Dst)
I/O
DEL diskFiles (N, Src)
Events
:-
fsRename (N, Src, Dst), diskFiles (N, Src, Type);
:-
fsRename (N, Src, Dst), diskFiles (N, Src, Type);
r14
fileTypes (N, File, Type)
:-
diskFiles(N, File), Type := Util.getType(File);
r15
blkMetas (N, B, Gs)
:-
fileTypes (N, File, Type), Type == metafile, Gs := Util.getGs(File);
r16
actualNodes (B, N)
:-
blkMetas (N, B, Gs), blkGenStamp (B, Gs);
r5
r6
r13
Expectation
Actual Facts
39
M
C
1
2
3
M
C
1
3
X
X
B
2
B
B
The
spec: “something is wrong”
Why? Where is the bug?
Let’s
write more detailed specs
40
M
1
2
3
X
First analysis
Client’s pipeline excludes Node2, why?
Maybe, client gets a bad ack for Node2
errBadAck (N) :dataAck (N, “Error”), liveNodes (N)
C
B
Second analysis
Client gets a bad ack for Node2, why?
Maybe, Node1 could not communicate to Node2
errBadConnect (N, TgtN) :dataTransfer (N, TgtN, “Terminated”),
liveNodes (TgtN)
We catch the bug!
Node2 cannot talk to Node3 (crash)
Node2 terminates all connections (including Node1!)
Node1 thinks Node2 is dead
41
nodes are accessible from the setup reply from the namenode to the client (a2). However, if there is a crash,
the expectation changes: the crashed node should be removed from the expected nodes (a4). This implies that
an expectation is also based on failure events.
• Failure events: Failures in different stages result in
different recovery behaviors. Thus, we must know precisely when failures occur. For data-transfer recovery,
we need to capture the current stage of the write process and only change the expectation if a crash occurs
within the data-transfer stage (f at eCr ashNode happens
at St g==2 in rule a4). The data transfer stage is deduced
in rules a5-a8: the second stage begins after all acks from
the setup phase have been received.
Before moving on, we emphasize two important observations here. First, this example shows how FATE
and D ESTINI must work hand in hand. That is, recovery
specifications require a failure service to exercise them,
and a failure service requires specifications of expected
failure handling. Second, with logic programming, developers can easily build expectations only from events.
• Facts: The fact (act ual Nodes ) is also built from events
(a9-a16), more specifically, by tracking the locations of
valid replicas. A valid replica can be tracked with two
pieces of information: the block’s latest generation time
stamp, which D ESTINI tracks by interposing two interfaces (a9 and a10), and meta/checksum files with the
More
detailed
specs
Catch
bugs closer
to the source and
earlier in time
Time, Events, and Errors
t1: Client asks the namenode for a block ID and the nodes.
cnpGet Bl kPi pe ( usr Fi l e, bl k x, gs1, 1, N1) ;
cnpGet Bl kPi pe ( usr Fi l e, bl k x, gs1, 2, N2) ;
cnpGet Bl kPi pe ( usr Fi l e, bl k x, gs1, 3, N3) ;
t2: Setup stage begins (pipeline nodes setup the files).
∗
f sCr eat e ( N1, t mp/ bl k x gs1. met a) ;
f sCr eat e ( N2, t mp/ bl k x gs1. met a) ;
f sCr eat e ( N3, t mp/ bl k x gs1. met a) ;
t3: Client receives setup acks. Data transfer begins.
cdpSet upAck ( bl k x, 1, OK) ;
cdpSet upAck ( bl k x, 2, OK) ;
cdpSet upAck ( bl k x, 3, OK) ;
Datanode’s view
t4: FATE crashes N3. Got error (b4).
f at eCr ashNode ( N3) ;
er r BadConnect ( N1, N2) ; / / shoul d be good
t5: Client receives an errorneous ack. Got error (b1).
cdpDat aAck ( 2, Er r or ) ;
er r BadAck ( 2, N2) ; / / shoul d be good
t6: Recovery begins. Get new generation time stamp.
Client’s view
dnpNext GenSt amp ( bl k x, gs2) ;
t7: Only N1 continues and finalizes the files.
f sCr eat e ( N1, t mp/ bl k x gs2. met a) ;
f sRename ( N1, t mp/ bl k x gs2. met a,
cur r ent / bl k x gs2. met a) ;
Global view
t8: Client marks completion. Got error (a1).
cnpCompl et e ( bl k x) ;
er r Dat aRec ( bl k x, N2) ; / / shoul d exi st
Table 4: A Timeline of D ESTINI Execution.
42
The
Design
patterns
Add detailed specs
Refine existing specs
Write specs from different views (global, client,
dn)
Incorporate diverse failures (crashes, net
partitions)
Express different violations (data-loss,
unavailability)
43
Research
FATE
background
and DESTINI
Motivation
FATE
DESTINI
Evaluation
44
Implementation complexity
~6000 LOC in Java
Target 3 popular cloud systems
HadoopFS (primary target)
-
Underlying storage for Hadoop/MapReduce
ZooKeeper
-
Distributed synchronization service
Cassandra
-
Distributed key-value store
Recovery bugs
Found 22 new HDFS bugs (confirmed)
-
Data loss, unavailability bugs
Reproduced 51 old bugs
45
“If multiple racks are available (reachable),
“Throw a violation if a block is only stored in one rack,
a block should be stored in a minimum of two racks”
but the rack is connected to another rack”
errorSingleRack(B) :- rackCnt(B,Cnt), Cnt==1, blkRacks(B,R),
connected(R,Rb),
endOfReplicationMonitor (_);
rackCn
errorSingleRac
t
k
B, 1
B
Rack #1
B
Client
blkRacks
B, R1
connected
R1, R2
Rack #2
B
B
Availability bug!
#replicas = 3,
locations are not checked
B is not migrated to R2
B
FATE injects
Replication
Monitor
rack partitioning
46
Reduce #experiments by an order of
magnitude
Each experiment = 4-9 seconds
Found the same number of bugs
(by experience)
7720
# Exps
Brute
Force
Pruned
5000
61
8
Write + Append + Write + Append +
2 crashes 2 crashes 3 crashes 3 crashes
47
Framework
#Chks
Lines/Chk
D3S [NSDI ’08]
10
53
Pip [NSDI ’06]
44
43
WiDS [NSDI ’07]
15
22
P2 Monitor [EuroSys
’06]
11
12
DESTINI
74
5
Compared
to other related work
48
Cloud
systems
All good, but must manage failures
Performance, reliability, availability depend on
failure recovery
FATE
and DESTINI
Explore multiple, diverse failures
systematically
Facilitate declarative recovery specifications
A unified framework
Real-world
adoption in progress
49
Research
FATE
background
and DESTINI
Thanks!
Questions?
50