Haryadi Gunawi UC Berkeley  Research  FATE background and DESTINI Local storag e Storage Servers e.g. Google Laptop Cloud Storage.

Transcript Haryadi Gunawi UC Berkeley  Research  FATE background and DESTINI Local storag e Storage Servers e.g. Google Laptop Cloud Storage.

Haryadi Gunawi
UC Berkeley
1
 Research
 FATE
background
and DESTINI
2
Local
storag
e
Storage
Servers
3
e.g.
Google
Laptop
Cloud Storage
4
Internet-service FS:
GoogleFS, HadoopFS, CloudStore, ...
Key-Value Store:
+ Replication
Cassandra,
Voldemort, ...
+
Scale-up
Structured
Storage:
Yahoo! PNUTS,
Google BigTable, Hbase, ...
+ Migration
“This is not just data.
+Custom
...
Storage:
It’s my life.
And I would be sick
if I lost it”
[CNN ’10]
Facebook Haystack Photo Store,
Microsoft StarTrack (Map Apps),
Amazon S3, EBS, ...
5
Cloudy
with a chance
of
failure
6
 Research
 FATE




background
and DESTINI
Motivation
FATE
DESTINI
Evaluation
Joint work with:
Thanh Do, Pallavi Joshi, Peter
Alvaro, Joseph Hellerstein, Andrea
Arpaci-Dusseau, Remzi ArpaciDusseau, Koushik Sen, Dhruba
Borthakur
7
 Cloud
 Thousands of commodity machines
 “Rare (HW) failures become frequent” [Hamilton]
 Failure
recovery
 “… has to come from the software” [Dean]
 “… must be a first-class op” [Ramakrishnan et
al.]
8
 Google
Chubby (lock service)
 Four occasions of data loss
 Whole-system down
 More problems after injecting multiple failures
(randomly)
 Google
BigTable (key-value store)
 Chubby bug affects BigTable availability
 More
details?
 How often? Other problems and implications?
9
 Open-source
cloud projects
 Very valuable bug/issue
repository!
 HDFS (1400+), ZooKeeper
(900+), Cassandra (1600+),
Hbase (3000+), Hadoop (7000+)
 HDFS
JIRA study
Implications
Count
loss
13
 1300 issues, 4 years (AprilData
2006
to July 2010)
 Select recovery problems Unavailability
due to hardware48
Corruption
19
failures
10
 91 recovery bugs/issues Misc.
10
 Testing
is not advanced enough
[Google]
 Failure model: multiple, diverse failures
 Recovery
is under-specified [Hamilton]
 Lots of custom recovery
 Implementation is complex
 Need
two advancements:
 Exercise complex failure modes
 Write specifications and test the
implementation
11
FATE
Failure Testing
Service
Cloud software
X
X
1
2
DESTINI
Declarative
Testing
Specifications
Violate
specs?
12
 Research
 FATE
background
and DESTINI
 Motivation
 FATE
-
Architecture
Failure exploration
 DESTINI
 Evaluation
13
M
HadoopFS
(HDFS)
Write
Protocol
C
1
2
Alloc
Req
Data
Transfe
r
No failures
M
C
1
3
Setu
p
Stag
e
2
3
4
M
C
1
2
3
X
1
X
2
Data Transfer Recovery:
Setup Recovery:
Recreate fresh pipeline (1, 2, 4) Continue on surviving nodes (1, 2)
14

Failures
 Anytime: different stages  different recovery
 Anywhere: N2 crash, and then N3
 Any type: bad disks, partitioned nodes/racks

FATE
 Systematically exercise multiple, diverse failures
 How? need to “remember” failures – via failure IDs
M
C
1
2
3
M
C
1
2
3
4
15

Abstraction of I/O failures

Building failure IDs
 Intercept every I/O
 Inject possible failures
-
Node2
Ex: crash, network partition, disk failure (LSE/corruption)
Node3
X
Note:
FIDs
A, B, C, ...
OutputStream.read() in
BlockReceiver.java
I/O
information
<stack trace>
Net I/O from N3 to N2
“Data Ack”
Failure
Crash After
Failure ID: 25
16
Workload
Driver
while (new FIDs)
{
hdfs.write()
}
Target system
(e.g. Hadoop FS)
AspectJ
Failure
Surface
I/O
Info
Failure
Server
(fail/no
fail?)
Java SDK
17
1 failure / run
M
C
1
2
Exp #1: A
A
Exp #2: B
A
2 failures / run
M
3
2
A
B
BC
C
3
A
AC
A
B
1
AB
B
Exp #3: C
C
C
A
C
18
 Research
 FATE
background
and DESTINI
 Motivation
 FATE
-
Architecture
Failure exploration challenge and solution
 DESTINI
 Evaluation
19

Exercised over 40,000 unique combinations of 1,
2, and 3 failures per run
 80 hours of testing time!

New challenge: Combinatorial explosion
 Need smarter exploration strategies
1
2
3
A
A
A
1
2
3
B
B
B
1
2
3
2 failures / run
A1 A2
A1 B2
B1 A2
B1 B2
...
20
 Properties
of multiple failures
 Pairwise dependent failures
 Pairwise independent failures
 Goal:
exercise distinct recovery
behaviors
 Key: some failures result in similar recovery
 Result: > 10x faster, and found the same bugs
21

Failure dependency graph

Recovery clustering
FID  Subseq
 Inject single failures first
FIDs
 Record subsequent dependent IDs
A X
- Ex: X depends on A
B X
 Brute-force: AX, BX, CX, DX, CY, DY C  X, Y
D  X, Y
 Two clusters: {X} and {X, Y}

Only exercise distinct clusters
 Pick a failureID that triggers the
recovery cluster
 Results: AX, CX, CY
A
B
X
C
D
Y
22
 Independent
combinations
 FP2 x N (N – 1)
 FP = 2, N = 3, Tot =
24
 Symmetric
code
 Just pick two nodes
 N (N – 1)  2
 FP2 x 2
1
2
3
A
A
A
1
2
3
B
B
B
1
2
3
1
2
3
A
A
A
1
2
3
B
B
B
1
2
3
23
 FP2
bottleneck
 FP = 4, total = 16
 Real example, FP = 15
 Recovery
clustering
 Cluster A and B if:
fail(A) == fail(B)
 Reduce FP2 to FP2clustered
 E.g.15 FPs to 8 FPs clustered
A
A
1
2
B
B
1
2
C
C
1
2
D
D
1
2
A
A
1
2
B
B
1
2
C
C
1
2
D
D
1
2
24
 Exercise
multiple, diverse failures
 Via failure IDs abstraction
 Challenge: combinatorial explosion of multiple
failures
 Smart
failure exploration strategies
 > 10x improvement
 Built on top of failure IDs abstraction
 Current
limitations
 No I/O reordering
 Manual workload setup
25
 Research
 FATE
background
and DESTINI
 Motivation
 FATE
 DESTINI
-
Overview
Building specifications
 Evaluation
26
 Is
the system correct under failures?
 Need to write specifications
[It is] great to document (in a
spec) the HDFS write protocol ...
…, but we shouldn't spend too
much time on it, … a formal
spec may be overkill for a
protocol we plan to deprecate
imminently.
Specs
Implementation X
X
2
1
27
 How
to write specifications?
 Developer friendly (clear, concise, easy)
 Existing approaches:
-
-
Unit test (ugly, bloated, not formal)
Others too verbose and long
 Declarative
relational logic language
(Datalog)
 Key: easy to express logical relations
28

How to write specs?
 Violations
 Expectations
 Facts

How to write recovery specs?
Specs
Implementation
 “... recovery is under specified” [Hamilton]
 Precise failure events
 Precise check timings

How to test implementation?
 Interpose I/O calls (lightweight)
 Deduce expectations and facts from I/O events
29
 Research
 FATE
background
and DESTINI
 Motivation
 FATE
 DESTINI
-
Overview
Building specifications
 Evaluation
30
“Throw a violation if
an expectation is different
from
the actual behavior”
violationTable(…) :expectationTable(…),
NOT-IN actualTable(…)
Datalog syntax:
head() :- predicates(), …
:- derivation
, AND
31
M
C
1
2
3
X
B
incorrectNodes
(Block, Node)
“Replicas should exist
in surviving nodes”
Data
Transfe
r
B
expectedNodes
(Block, Node)
actualNodes
(Block, Node)
B
Node 1
B
Node 1
B
Node 2
B
Node 2
incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN
actualNodes(B, N);
32
M
C
1
2
3
X
B
incorrectNodes
(Block, Node)
B
Node 2
B
expectedNodes
(Block, Node)
B
Node 1
B
Node 2
actualNodes
(Block, Node)
B
Node 1
incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN
actualNodes(B, N);
33
M
C
1
2
3

Ex: which nodes should
have the blocks?
X

Deduce expectations
from I/O events (italic)
2
expectedNodes
(Block, Node)
M
getBlockPipe(…)
Give me 3 nodes for B
C
[Node1, Node2, Node3]
B
Node 1
B
Node 2
B
Node 3
expectedNodes (B, N)
:getBlockPipe (B, N);
#1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);
34
expectedNodes
(Block, Node)
M
B
Node 1
B
Node 2
B
Node 3
C
1
2
DEL expectedNodes (B, N) :expectedNodes (B, N),
fateCrashNode (N)
3
X
B
B
#1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);
#2: expectedNodes(B, N) :- getBlockPipe(B,N)
35
DEL expectedNodes (B, N) :Different stages 
expectedNodes (B, N),
different recovery
fateCrashNode (N),
behaviors
Precise failure events  writeStage (B, Stage),
Data
Stage == “Data Transfer”;
transfer
recovery
#1: incorrectNodes(B,N)
:- expectedNodes(B,N), NOT-IN actualNodes(B,N)
M
C
1
2
3
#2: expectedNodes(B,N) :- getBlockPipe(B,N)
#3: expectedNodes(B,N)
:- expectedNodes(B,N), fateCrashNode(N),
vs.
writeStage(B,Stage),
Stage == “DataTransfer”
M
C
1
2
3
4
#4: writeStage(B,
:- writeStage(B,“Setup”), nodesCnt(Nc), acksCnt
Setup “DataTr”)
(Ac), Nc==Ac
stage
#5: nodesCnt (B, CNT<N>)
:- pipeNodes (B, N);
recovery(B, N)
#6: pipeNodes
:- getBlockPipe (B, N);
#7: acksCnt (B, CNT<A>)
:- setupAcks (B, P, “OK”);
#8: setupAcks (B, P, A)
:- setupAck (B, P, A);
36
 Ex:
M
C
1
2
3
X
B
which nodes
actually store the
valid blocks?
 Deduced from disk I/O
events in 8 datalog
rules
actualNodes
(Block, Node)
B
#1: incorrectNodes(B,N)
Node 1
:- expectedNodes(B,N), NOT-IN actualNodes(B,N)
37
#1:
 Recovery ≠ invariant
incorrectNodes(B, N) : If recovery is ongoing,
expectedNodes(B, N),
invariants are violated
NOT-IN actualNodes(B, N),
 Don’t want false alarms
completeBlock (B);
 Need precise check
timings
 Ex: upon block completion
38
r1
incorrectNodes (B, N)
:-
cnpComplete (B), expectedNodes (B, N), NOT-IN actualNodes (B, N);
r2
pipeNodes (B, Pos, N)
:-
getBlkPipe (UFile, B, Gs, Pos, N);
r3
expectedNodes (B, N)
:-
getBlkPipe (UFile, B, Gs, Pos, N);
r4
DEL expectedNodes (B, N)
:-
fateCrashNode (N), pipeStage (B, Stg), Stg == 2, expectedNodes (B, N);
:-
cdpSetupAck (B, Pos, Ack);
:-
setupAcks (B, Pos, Ack), Ack == ’OK’;
r7
Failure
setupAcks (B, Pos, Ack)
event
goodAcksCnt (B, COUNT<Ack>)
(crash)
nodesCnt (B, COUNT<Node>)
:-
pipeNodes (B, , N, );
r8
pipeStage (B, Stg)
:-
nodesCnt (NCnt), goodAcksCnt (ACnt), NCnt == Acnt, Stg := 2;
r9
blkGenStamp (B, Gs)
:-
dnpNextGenStamp (B, Gs);
r10
blkGenStamp (B, Gs)
:-
cnpGetBlkPipe (UFile, B, Gs, , );
r11
diskFiles (N, File)
:-
fsCreate (N, File);
r12
diskFiles (N, Dst)
I/O
DEL diskFiles (N, Src)
Events
:-
fsRename (N, Src, Dst), diskFiles (N, Src, Type);
:-
fsRename (N, Src, Dst), diskFiles (N, Src, Type);
r14
fileTypes (N, File, Type)
:-
diskFiles(N, File), Type := Util.getType(File);
r15
blkMetas (N, B, Gs)
:-
fileTypes (N, File, Type), Type == metafile, Gs := Util.getGs(File);
r16
actualNodes (B, N)
:-
blkMetas (N, B, Gs), blkGenStamp (B, Gs);
r5
r6
r13
Expectation
Actual Facts
39
M
C
1
2
3
M
C
1
3
X
X
B
2
B
B
 The
spec: “something is wrong”
 Why? Where is the bug?
 Let’s
write more detailed specs
40
M

1
2
3
X
First analysis
 Client’s pipeline excludes Node2, why?
 Maybe, client gets a bad ack for Node2
errBadAck (N) :dataAck (N, “Error”), liveNodes (N)

C
B
Second analysis
 Client gets a bad ack for Node2, why?
 Maybe, Node1 could not communicate to Node2
errBadConnect (N, TgtN) :dataTransfer (N, TgtN, “Terminated”),
liveNodes (TgtN)

We catch the bug!
 Node2 cannot talk to Node3 (crash)
 Node2 terminates all connections (including Node1!)
 Node1 thinks Node2 is dead
41
nodes are accessible from the setup reply from the namenode to the client (a2). However, if there is a crash,
the expectation changes: the crashed node should be removed from the expected nodes (a4). This implies that
an expectation is also based on failure events.
• Failure events: Failures in different stages result in
different recovery behaviors. Thus, we must know precisely when failures occur. For data-transfer recovery,
we need to capture the current stage of the write process and only change the expectation if a crash occurs
within the data-transfer stage (f at eCr ashNode happens
at St g==2 in rule a4). The data transfer stage is deduced
in rules a5-a8: the second stage begins after all acks from
the setup phase have been received.
Before moving on, we emphasize two important observations here. First, this example shows how FATE
and D ESTINI must work hand in hand. That is, recovery
specifications require a failure service to exercise them,
and a failure service requires specifications of expected
failure handling. Second, with logic programming, developers can easily build expectations only from events.
• Facts: The fact (act ual Nodes ) is also built from events
(a9-a16), more specifically, by tracking the locations of
valid replicas. A valid replica can be tracked with two
pieces of information: the block’s latest generation time
stamp, which D ESTINI tracks by interposing two interfaces (a9 and a10), and meta/checksum files with the
 More
detailed
specs
 Catch
bugs closer
to the source and
earlier in time
Time, Events, and Errors
t1: Client asks the namenode for a block ID and the nodes.
cnpGet Bl kPi pe ( usr Fi l e, bl k x, gs1, 1, N1) ;
cnpGet Bl kPi pe ( usr Fi l e, bl k x, gs1, 2, N2) ;
cnpGet Bl kPi pe ( usr Fi l e, bl k x, gs1, 3, N3) ;
t2: Setup stage begins (pipeline nodes setup the files).
∗
f sCr eat e ( N1, t mp/ bl k x gs1. met a) ;
f sCr eat e ( N2, t mp/ bl k x gs1. met a) ;
f sCr eat e ( N3, t mp/ bl k x gs1. met a) ;
t3: Client receives setup acks. Data transfer begins.
cdpSet upAck ( bl k x, 1, OK) ;
cdpSet upAck ( bl k x, 2, OK) ;
cdpSet upAck ( bl k x, 3, OK) ;
Datanode’s view
t4: FATE crashes N3. Got error (b4).
f at eCr ashNode ( N3) ;
er r BadConnect ( N1, N2) ; / / shoul d be good
t5: Client receives an errorneous ack. Got error (b1).
cdpDat aAck ( 2, Er r or ) ;
er r BadAck ( 2, N2) ; / / shoul d be good
t6: Recovery begins. Get new generation time stamp.
Client’s view
dnpNext GenSt amp ( bl k x, gs2) ;
t7: Only N1 continues and finalizes the files.
f sCr eat e ( N1, t mp/ bl k x gs2. met a) ;
f sRename ( N1, t mp/ bl k x gs2. met a,
cur r ent / bl k x gs2. met a) ;
Global view
t8: Client marks completion. Got error (a1).
cnpCompl et e ( bl k x) ;
er r Dat aRec ( bl k x, N2) ; / / shoul d exi st
Table 4: A Timeline of D ESTINI Execution.
42
The
 Design
patterns
 Add detailed specs
 Refine existing specs
 Write specs from different views (global, client,
dn)
 Incorporate diverse failures (crashes, net
partitions)
 Express different violations (data-loss,
unavailability)
43
 Research
 FATE




background
and DESTINI
Motivation
FATE
DESTINI
Evaluation
44

Implementation complexity
 ~6000 LOC in Java

Target 3 popular cloud systems
 HadoopFS (primary target)
-
Underlying storage for Hadoop/MapReduce
 ZooKeeper
-
Distributed synchronization service
 Cassandra
-

Distributed key-value store
Recovery bugs
 Found 22 new HDFS bugs (confirmed)
-
Data loss, unavailability bugs
 Reproduced 51 old bugs
45
“If multiple racks are available (reachable),
“Throw a violation if a block is only stored in one rack,
a block should be stored in a minimum of two racks”
but the rack is connected to another rack”
errorSingleRack(B) :- rackCnt(B,Cnt), Cnt==1, blkRacks(B,R),
connected(R,Rb),
endOfReplicationMonitor (_);
rackCn
errorSingleRac
 t
k
B, 1
B
Rack #1
B
Client
blkRacks
B, R1
connected
R1, R2
Rack #2
B
B
Availability bug!
#replicas = 3,
locations are not checked
B is not migrated to R2
B
FATE injects
Replication
Monitor
rack partitioning
46

Reduce #experiments by an order of
magnitude
 Each experiment = 4-9 seconds

Found the same number of bugs
 (by experience)
7720
# Exps
Brute
Force
Pruned
5000
61
8
Write + Append + Write + Append +
2 crashes 2 crashes 3 crashes 3 crashes
47
Framework
#Chks
Lines/Chk
D3S [NSDI ’08]
10
53
Pip [NSDI ’06]
44
43
WiDS [NSDI ’07]
15
22
P2 Monitor [EuroSys
’06]
11
12
DESTINI
74
5
 Compared
to other related work
48
 Cloud
systems
 All good, but must manage failures
 Performance, reliability, availability depend on
failure recovery
 FATE
and DESTINI
 Explore multiple, diverse failures
systematically
 Facilitate declarative recovery specifications
 A unified framework
 Real-world
adoption in progress
49
 Research
 FATE
background
and DESTINI
 Thanks!
Questions?
50

Haryadi Gunawi UC Berkeley  Research  FATE background and DESTINI Local storag e Storage Servers e.g. Google Laptop Cloud Storage.

Transcript Haryadi Gunawi UC Berkeley  Research  FATE background and DESTINI Local storag e Storage Servers e.g. Google Laptop Cloud Storage.

Directory