SIGCOMM talk

Download Report

Transcript SIGCOMM talk

Detailed diagnosis in
enterprise networks
Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD),
Sharad Agarwal, Jitu Padhye, Victor Bahl
Network diagnosis
Explaining faulty behavior
ratul | sigcomm | '09
Current landscape of
network diagnosis systems
Big enterprises
Large ISPs
Small enterprises
Network size
?
ratul | sigcomm | '09
Why study small enterprise networks
separately?
Big enterprises
Large ISPs
Small enterprises
Less sophisticated admins
Less rich connectivity
Many shared components
IIS, SQL,
Exchange, …
ratul | sigcomm | '09
Our work
1. Shows that small enterprises need “detailed diagnosis”
•
Not enabled by current systems that focus on scale
2. Develops NetMedic for detailed diagnosis
•
Diagnoses application faults without application knowledge
ratul | sigcomm | '09
Understanding problems in small enterprises
Symptoms, root causes
100+
cases
ratul | sigcomm | '09
And the survey says …..
Identified cause
Symptom
App-specific
60 %
Failed initialization
13 %
Poor performance
10 %
Hang or crash
10 %
Unreachability
7%
Handle app-specific
as well as generic faults
Non-app config
(e.g., firewall)
30 %
Software/driver bug
21 %
App config
19 %
Overload
4%
Hardware fault
2%
Identify culprits 25 %
at a fine granularity
Unknown
Detailed diagnosis
7
Example problem 1: Server misconfig
Browser
Web
server
Browser
ratul | sigcomm | '09
Server
config
Example problem 2: Buggy client
SQL
client C1
Requests
SQL
client C2
ratul | sigcomm | '09
SQL
server
Current formulations sacrifice detail (to scale)
Dependency graph based formulations (e.g., Sherlock [SIGCOMM2007])
• Model the network as a dependency graph at a coarse level
• Simple dependency model
ratul | sigcomm | '09
Example problem 1: Server misconfig
Browser
Web
server
Server
config
Browser
The network model is too
coarse in current formulations
ratul | sigcomm | '09
Example problem 2: Buggy client
SQL
client C1
Requests
SQL
server
SQL
client C2
The dependency model is too
simple in current formulations
ratul | sigcomm | '09
A formulation for detailed diagnosis
SQL
client
C1
Dependency graph of
fine-grained components
Component state is a
multi-dimensional vector
% CPU time
IO bytes/sec
Connections/sec
404 errors/sec
SQL
svr
Exch.
svr
SQL
client
C2
IIS
config
Process
OS
Config
ratul | sigcomm | '09
IIS
svr
The goal of diagnosis
Identify likely culprits
for components of
interest
Without using
semantics of state
variables
 No application
knowledge
C1
Svr
C2
Process
OS
Config
ratul | sigcomm | '09
Using joint historical behavior to estimate impact
How “similar”
on average
states of D are
at those times
s n a sn b
. .
. .
. .
. .
. .
s1a s1b
dn a dn b dn c
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
d1a d1 b d1c
D
S
d0a d 0b d0c
Request rate (low)
Response time (high)
Request rate (high)
Response time (high)
ratul | sigcomm | '09
s0a s0b s0c s0d
H
C1
Svr
H
C2
sn c sn d
. .
. .
. .
. .
. .
s1c s1d
L
Request rate (high)
Identify time
periods when
state of S was
“similar”
Robust implementation of impact estimation
•
•
•
•
•
Ignore state variables that represent redundant info
Place higher weight on state variables likely related
to faults being diagnosed
Ignore state variables irrelevant to interaction with
neighbor
Account for aggregate relationships among state
variables of neighboring components
Account for disparate ranges of state variables
ratul | sigcomm | '09
Implementation of NetMedic
Monitor
components
Component
states
Diagnose
a. edge impact
b. path impact
Ranked list of
likely culprits
ratul | sigcomm | '09
Target components
Diagnosis time
Reference time
Evaluation setup
IIS, SQL,
Exchange, …
.
10 actively
.
used desktops
.
#components
#dimensions per
component (avg)
Diverse set of faults observed in the logs
ratul | sigcomm | '09
~1000
35
NetMedic assigns low ranks to actual culprits
Cumulative % of faults
100
80
NetMedic
Coarse
60
40
20
0
0
20
40
60
Rank of actual culprit
ratul | sigcomm | '09
80
100
NetMedic handles concurrent faults well
Cumulative % of faults
100
80
60
NetMedic
40
Coarse
20
0
0
20
40
60
Rank of actual culprit
2 simultaneous faults
ratul | sigcomm | '09
80
100
Other results in the paper
Netmedic needs a modest amount (~60 mins) of history
It compares favorably with a method that understands
variable semantics
ratul | sigcomm | '09
Conclusions
NetMedic enables detailed diagnosis in enterprise
networks w/o application knowledge
Think small: Small enterprise networks deserve
more attention
ratul | sigcomm | '09