Project5: Performance debugging for distributed systems of

Download Report

Transcript Project5: Performance debugging for distributed systems of

Performance Debugging for
Distributed Systems of
Black Boxes
Marcos K. Aguilera
Jeffrey C. Mogul
Janet L. Wiener
HP Labs
Patrick Reynolds, Duke
Athicha Muthitacharoen, MIT
WISP 2004
11 November 2004
Example multi-tier system
client
web server
application
server
client
web server
application
server
database
server
20 October 2003
web server
authentication
server
database
server
Project5 - SOSP
page 2
Motivation
• Complex distributed systems are built from black box
components
• These systems may have performance problems
• High or erratic latency
• Locating the causes of these problems is hard
• We can’t always examine or modify system components
• We need tools to infer where bottlenecks are
• Choose which black boxes to open
20 October 2003
Project5 - SOSP
page 3
Contributions of our work
• Tools to highlight which black boxes have problems
• Require only passive information, such as packet traces
• Infer where most of time is spent from traces
• Person can then use more invasive tools to examine
those boxes
• Reduce time and cost to debug complex systems
• Improve quality of delivered systems
20 October 2003
Project5 - SOSP
page 4
Example causal path
client
web server
application
server
client
web server
application
server
web server
authentication
server
100ms
database
server
20 October 2003
database
server
Project5 - SOSP
page 5
Goals of our tools
•
Find high-impact causal paths through a distributed
system
Causal path: series of nodes that sent/received messages
– Each message is caused by receipt of previous message
– Some causal paths occur many times
High impact:
– Occurs frequently
– Contributes significantly to overall latency
•
Without modifications or semantic knowledge
• Report per-node latencies on causal paths
20 October 2003
Project5 - SOSP
page 6
Overview of our approach
•
Obtain traces of messages between components
–
–
Ethernet packets, middleware messages, etc.
Collect traces as non-invasively as possible
•
Analyze traces using algorithms
•
Visualize results and highlight high-impact paths
•
Requires very little information:
[timestamp, source, destination]
20 October 2003
Project5 - SOSP
page 7
Outline
•
•
•
•
•
•
Problem statement & goals
Overview of our approach
Algorithm
Experimental results
Related work
Conclusions
20 October 2003
Project5 - SOSP
page 8
The convolution algorithm: input
Time
From
To
0.01
A
B
0.02
A
B
0.04
B
D
0.05
C
F
...
20 October 2003
Project5 - SOSP
page 9
The convolution algorithm: output
A
C
B
.15
.10
0
G
20 October 2003
E
0
0
G
0
0
F
E
D
Project5 - SOSP
E
F
0
G
.10
.10
0
0
G
F
G
G
page 10
Basic idea
•
Creates a “time signal” for messages from each node
S1(t)=(AB msgs)
1 2 3
•
4
5 6 7
time
Given time signals S1(t)=(AB) and S2(t)=(BX)
Computes convolution of S2(t) and S1(–t) = S1 * S2
(can be computed quickly using fast fourier transforms)
20 October 2003
Project5 - SOSP
page 11
S1(t)=(AB msgs)
S2(t)=(BX msgs)
S1 * S2=
conv(S2(t), S1(-t))
• Spikes suggest causality between AB and BX msgs
• Time shift of a spike indicates its characteristic delay
20 October 2003
Project5 - SOSP
page 12
Details: first step
•
Choose starting node A
• Use trace to add edges from it
Time
0.01
0.02
0.04
0.05
20 October 2003
From
A
A
A
A
To
B
B
C
B
A
B
Project5 - SOSP
C
page 13
Continuing
Time
…
…
…
…
From
B
B
B
B
To
D
E
F
G
A
B
C
??
(AB)*(BD)
(AB)*(BE)
d
20 October 2003
Project5 - SOSP
page 14
How
Time
t1
t2
t3
t4
20 October 2003
From
A
A
A
A
To
B
B
B
B
Time
…
t1+d
…
t2+d
…
t3+d
t3+d
…
t4+d
Project5 - SOSP
From
To
B
D
B
D
B
B
D
E
B
D
page 15
Heuristic to find spikes
threshold 1: n1 stddev over mean
threshold 2: n2 stddev over mean
n1 = 2
n2 = 1.5
20 October 2003
Project5 - SOSP
page 16
Recursing to continue
A
B
d
•
D
Observations:
1. (BD) are not all msgs from B to D
(only those caused by A)
??
2. Stop recursion when too few messages left
or no more spikes found
20 October 2003
Project5 - SOSP
page 17
Outline
•
•
•
•
•
Problem statement & goals
Overview of our approach
Algorithm
Experimental results
Conclusions
20 October 2003
Project5 - SOSP
page 18
Results: email service delays
•
Jeff logged all email headers for two months
• Parsed 80K Received headers in 12K messages
Received: from cceexg11.americas.cpqcorp.net ...
by wera.hpl.hp.com ... ; Fri, 4 Apr 2003 15:35:54 -0800
–
•
Used Convolution Algorithm to
–
–
•
Yields (timestamp, sender, receiver) trace records
Reconstruct message paths
Find typical delays
Note: this is NOT the most direct way to use email
headers
–
We made the problem harder so as to test our algorithm
20 October 2003
Project5 - SOSP
page 19
Email trace: output
60
39
37
489
0,15
40
478
0
41
20 October 2003
738
0,10
38
41
766
0
460
0
40
439
0
594
0
67
38
626
0
41
Project5 - SOSP
41
523
0,10
40
512
0
41
768
0,10
38
635
0
41
page 20
Results: Petstore
•
•
Sun’s demo application for J2EE
Stanford’s PinPoint project
provided us with traces
–
One trace has a node that is
artificially slowed down
20 October 2003
Project5 - SOSP
page 21
Future work
•
Automate trace gathering and conversion
• Sliding-window versions of algorithms
•
•
–
Find phased behavior
–
Reduce memory usage of nesting algorithm
–
Improve speed of convolution algorithm
Validate usefulness on more complicated systems
What are limits of our approach?
20 October 2003
Project5 - SOSP
page 22
Conclusion
•
Looking for bottlenecks in black box systems
•
Use signal processing techniques to find causal paths
in the network and its delays
•
For more information
–
•
http://www.hpl.hp.com/research/project5/
Contact us if you have multi-hop message traces!
20 October 2003
Project5 - SOSP
page 23