Pip: Detecting the Unexpected in Distributed Systems Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener Charles Killian Jeffrey Mogul Amin Vahdat Mehul Shah UCSD HP Labs http://issg.cs.duke.edu/pip/

Pip: Detecting the Unexpected in Distributed Systems Patrick Reynolds [email protected] Collaborators: Janet Wiener Charles Killian Jeffrey Mogul Amin Vahdat Mehul Shah UCSD HP Labs http://issg.cs.duke.edu/pip/

Transcript Pip: Detecting the Unexpected in Distributed Systems Patrick Reynolds [email protected] Collaborators: Janet Wiener Charles Killian Jeffrey Mogul Amin Vahdat Mehul Shah UCSD HP Labs http://issg.cs.duke.edu/pip/

Pip: Detecting the
Unexpected in
Distributed Systems
Patrick Reynolds
[email protected]
Collaborators:
Janet Wiener
Charles Killian
Jeffrey Mogul
Amin Vahdat
Mehul Shah
UCSD
HP Labs
http://issg.cs.duke.edu/pip/
Introduction
•
Distributed systems are essential to the Internet
• Distributed systems are complex
–
•
Pip uses causal paths to find bugs
–
–
•
Harder to debug than centralized systems
Characterize system behavior
Deviations from expected behavior often indicate bugs
Experimental results show that Pip is effective
–
Found 19 bugs in 6 test systems
NSDI - May 8th, 2006
page 2
Challenges of debugging distributed systems
•
Distributed systems have more bugs
–
–
Parallelism is hard
New sources of failure
• More components = more faults
• Network errors
• Security breaches
•
Finding bugs is harder
–
More nodes, events, and messages to keep track of
– Applications may cross administrative domains
– Bugs on one node may be caused by events on another
NSDI - May 8th, 2006
page 3
Causal path analysis
•
Programmers wish to
examine and check systemwide behaviors
–
–
–
•
Causal paths
Components of end-to-end delay
Attribution of resource
consumption
Unexpected behavior might
indicate a bug
–
Web server
500ms
2000
page
faults
App server
Database
Pip compares actual behavior to
expected behavior
NSDI - May 8th, 2006
page 4
What kinds of bugs?
•
Structure bugs
–
•
–
•
500ms App server
Performance bugs
–
Web server
Incorrect placement/timing of processing and
communication
Throughput bottlenecks
Over- or under-consumption of resources
2000
page
faults
Database
Bugs present in an actual run of the
system
–
Selecting input for path coverage is orthogonal
NSDI - May 8th, 2006
page 5
Three target audiences
•
Primary programmer
–
•
Secondary programmer
–
–
•
Debugging or optimizing his/her own system
Inheriting a project or joining a programming team
Discovery: learning how the system behaves
Operator
–
–
Monitoring a running system for unexpected behavior
Performing regression tests after a change
NSDI - May 8th, 2006
page 6
Outline
•
Introduction
• Pip
–
–
–
•
Expressing expected behavior
Exploring application behavior
Results
Conclusions
NSDI - May 8th, 2006
page 7
Workflow
Application
Pip:
1.
2.
3.
4.
Behavior
model
Expectations
Captures events from a running
system
Reconstructs behavior from events
Pip checker
Checks behavior against expectations
Displays unexpected behavior
Unexpected
Resource
Both structure and resource
structure
violations
violations
Goal: help programmers locate and
explain bugs
NSDI - May 8th, 2006
Pip explorer:
visualization GUI
page 8
Describing application behavior
•
Application behavior consists of paths
–
–
–
All events, on any node, related to one high-level
operation
Definition of a path is programmer defined
Path is often causal, related to a user request
WWW Parse
HTTP
App srv
DB
Send response
Run application
Query
NSDI - May 8th, 2006
time
page 9
Describing application behavior
•
Within paths are tasks, messages, and notices
–
–
Tasks: processing with start and end points
Messages: send and receive events for any
communication
• Includes network, synchronization (lock/unlock), and timers
–
Notices: time-stamped strings; essentially log entries
“Request = /cgi/…”
WWW Parse
HTTP
App srv
DB
“2096 bytes in response”
“done with request 12”
Send response
Run application
Query
NSDI - May 8th, 2006
time
page 10
Outline
•
Introduction
• Pip
–
–
–
•
Expressing expected behavior
Exploring application behavior
Results
Conclusions
NSDI - May 8th, 2006
page 11
Expectations
•
Pip has two kinds of expectations
• Recognizers classify paths into
sets
R1
R2
R3
•
Aggregates assert properties of
sets
?

NSDI - May 8th, 2006
page 12
Expectations: recognizers
•
•
Application behavior consists of paths
Each recognizer matches paths
–
•
•
A path can match more than one recognizer
A recognizer can be a validator, an invalidator, or neither
Any path matching zero validators or at least one invalidator
is unexpected behavior: bug?
validator CGIRequest
thread WebServer(*, 1)
task(“Parse HTTP”) limit(CPU_TIME, 100ms);
notice(m/Request URL: .*/);
send(AppServer);
recv(AppServer);
invalidator DatabaseError
notice(m/Database error: .*/);
NSDI - May 8th, 2006
page 13
Expectations: recognizers language
•
repeat: matches a ≤ n ≤ b copies of a block
repeat between 1 and 3 { … }
•
xor: matches any one of several blocks
xor {
}
•
branch: …
branch: …
future: lets a block match now or later
–
done: forces the named block to match
future F1 { … }
…
done(F1);
NSDI - May 8th, 2006
page 14
Expectations: aggregate expectations
•
Recognizers categorize paths into sets
• Aggregates make assertions about sets of paths
–
–
Instances, unique instances, resource constraints
Simple math and set operators
assert(instances(CGIRequest) > 4);
assert(max(CPU_TIME, CGIRequest) < 500ms);
assert(max(REAL_TIME, CGIRequest) <=
3*avg(REAL_TIME, CGIRequest));
NSDI - May 8th, 2006
page 15
Automating expectations and annotations
•
Expectations can be
generated from behavior
model
–
–
–
•
Create a recognizer for each
actual path
Eliminate repetition
Strike a balance between overand under-specification
Annotations can be generated
by middleware
–
Application
Annotations
Behavior
model
Expectations
Pip checker
Unexpected
behavior
As in several of our test systems
NSDI - May 8th, 2006
page 16
Exploring behavior
•
Expectations checker generates lists of valid
and invalid paths
• Explore both sets
–
Why did invalid paths occur?
– Is any unexpected behavior misclassified as valid?
• Insufficiently constrained expectations
• Pip may be unable to express all expectations
NSDI - May 8th, 2006
page 17
Visualization: causal paths
Caused
Timing
tasks,
andmessages,
resource
Causal view
properties
andofnotices
path foron
one
that
task
thread
NSDI - May 8th, 2006
page 18
Visualization: timelines
•
View one path instance as a timeline
–
Different colors = different hosts
– Each bar = task
– Click a bar to see task details
NSDI - May 8th, 2006
page 19
Delay (ms)
Visualization: performance graphs
Time (s)
•
Plot per-task or per-path resource metrics
–
Cumulative distribution (CDF)
– Probability density (PDF)
– Changing values over time
•
Click on a point to see its value and the task/path represented
NSDI - May 8th, 2006
page 20
Results
•
We have applied Pip to several distributed systems:
–
FAB: distributed block store
– SplitStream: DHT-based multicast protocol
– Others: RanSub, Bullet, SWORD, Oracle of Bacon
•
We have found unexpected behavior in each system
• We have fixed bugs in some systems
… and used Pip to verify that the behavior was fixed
NSDI - May 8th, 2006
page 21
Results: SplitStream
DHT-based multicast protocol
13 bugs found, 12 fixed
–
•
Structural bug: some nodes have up to 25 children
when they should have at most 18
–
–
–
•
11 found using expectations, 2 found using GUI
This bug was fixed and later reoccurred
Root cause #1: variable shadowing
Root cause #2: failed to register a callback
How discovered: first in the explorer GUI,
confirmed with automated checking
NSDI - May 8th, 2006
page 22
Results: FAB
Distributed block store
1 bug found, fixed
–
•
Four protocols checked: read, write, Paxos, membership
Performance bug: quorum operations call nodes in
arbitrary order
–
Should overlap computation when possible
NSDI - May 8th, 2006
page 23
Results: FAB
Distributed block store
•
For disk-bound workloads, call self last
• For cache-bound workloads, call self second
–
Actually, last in quorum
NSDI - May 8th, 2006
page 24
Conclusions
•
Causal paths are a useful abstraction of
distributed system behavior
• Expectations serve as a high-level description
–
–
•
Summary of inter-component behavior and timing
Regression test for structure and performance
Finding unexpected behavior can help us find
bugs
–
Both structure and performance bugs
– Visualization can expose additional bugs
http://issg.cs.duke.edu/pip/
NSDI - May 8th, 2006
page 25
Extra slides
•
•
•
•
•
•
•
•
Related work
Pip results: RanSub
Pip vs. printf
Pip resource metrics
Building a behavior model
Annotations
Checking expectations
Visualization: communication graph
NSDI - May 8th, 2006
26
Related work
•
Causal paths
–
–
•
Black-box inferences
–
–
•
Finding stepping stones [Zhang, 1997]
Wavelets to debug network performance [Huang, 2001]
Expectations-checking systems
–
–
–
•
Pinpoint [Chen, 2004]
Magpie [Barham, 2004]
PSpec [Perl, 1993]
Meta-level compilation [Engler, 2000]
Paradyn [Miller, 1995]
Model checking
–
–
MaceMC [Killian, 2006]
VeriSoft [Godefroid, 2005]
NSDI - May 8th, 2006
page 27
Pip results: RanSub
2 bugs found, 1 fixed
• Structural bug: during first round of
communication, parent nodes send summary
messages before hearing from all children
–
•
Root cause: uninitialized state variables
Performance bug: linear increase in end-to-end
delay for the first ~2 minutes
–
Suspected root cause: data structure listing all
discovered nodes
NSDI - May 8th, 2006
page 28
Pip vs. printf
•
Pip
printf
Nesting, causal order
Time, path, and thread
CPU and I/O data
Automatic verification using
declarative language
SQL queries
Automatic generation for
some middleware
Unstructured
No context
No resource information
Verification with ad hoc grep
or expect scripts
“Queries” using Perl scripts
Manual placement
Both record interesting events to check off-line
–
–
Pip imposes structure and automates checking
Generalizes ad hoc approaches
NSDI - May 8th, 2006
page 29
Pip resource metrics
•
Real time
• User time, system time
–
–
•
•
•
•
•
•
CPU time = user + system
Busy time = CPU time / real time
Major and minor page faults (paging and allocation)
Voluntary and involuntary context switches
Message size and latency
Number of messages sent
Causal depth of path
Number of threads, hosts in path
NSDI - May 8th, 2006
page 30
Building a behavior model
Model consists of paths constructed from events
recorded by the running application
Sources of events:
• Annotations in source code
–
•
Annotations in middleware
–
–
•
Middleware inserts annotations automatically
Faster and less error-prone
Passive tracing or interposition
–
•
Programmer inserts statements manually
Easier, but less information
Or any combination of the above
NSDI - May 8th, 2006
page 31
Annotations
•
Set path ID
• Start/end task
• Send/receive message
• Notice
“Request = /cgi/…”
WWW Parse
HTTP
App srv
DB
“2096 bytes in response”
“done with request 12”
Send response
Run application
Query
NSDI - May 8th, 2006
time
page 32
Checking expectations
Application
Traces
Match start/end task,
send/receive message
Reconciliation
Events database
Path construction
Paths
Organize events into
causal paths
Expectations
Expectation checking
Categorized paths
NSDI - May 8th, 2006
For each path P
For each recognizer R
Does R match P?
Check each aggregate
page 33
Visualization: communication graph
•
Graph view of all host-to-host network traffic
–
–
Nodes = hosts, edges = network links in use
Directed edge = unidirectional communication
NSDI - May 8th, 2006
page 34

Pip: Detecting the Unexpected in Distributed Systems Patrick Reynolds [email protected] Collaborators: Janet Wiener Charles Killian Jeffrey Mogul Amin Vahdat Mehul Shah UCSD HP Labs http://issg.cs.duke.edu/pip/

Transcript Pip: Detecting the Unexpected in Distributed Systems Patrick Reynolds [email protected] Collaborators: Janet Wiener Charles Killian Jeffrey Mogul Amin Vahdat Mehul Shah UCSD HP Labs http://issg.cs.duke.edu/pip/

Directory