Fault Detection, isolation, and Diagnosis In Multihop
Download
Report
Transcript Fault Detection, isolation, and Diagnosis In Multihop
Fault Detection, Isolation,
and Diagnosis In Multihop
Wireless Networks
Lili Qiu, Paramvir Bahl, Ananth Rao, and Lidong
Zhou
Microsoft Research
Presented by
-Maitreya Natu
Network Management
Faults directory
…
Healthy network
Root cause
Corrective measure
Faulty network
Tasks involved in Network
Management
Continuously monitoring the functioning
Collecting information about the nodes and the
links
Removing inconsistencies and noise from the
reported information
Analyzing the information
Taking appropriate actions to improve network
reliability and performance
Challenges in wireless networks
Dynamic and unpredictable topology
link
errors due to fluctuating environment
conditions
Node mobility
Limited capacity
Scarcity
of resources
Link attacks
Proposed framework
Reproduce inside a simulator, the realworld events that took place
Use online trace driven simulation to
detect faults and analyze the root causes
Network Management
Network model
Types of faults
Healthy network
Faults directory
…
Creating a network model
Network Management
Network model
Fault diagnosis
Types of faults
Faulty network
Faults directory
…
Detected faults
Network Management
Network model
what-if analysis
Types of faults
Faults directory
Corrective measures
…
Detected faults
Key issues
How to Accurately reproduce what
happened in the network inside a
simulator
How to build fault diagnosis on top of a
simulator to perform root cause analysis
Accurate modeling
Use real traces from the diagnosed
network
Removes
dependency on generic theoretical
models
Captures nuances of the hardware, software
and environment of the particular network
Collect good quality data
By
developing a technique to effectively rule
out erroneous data
Fault diagnosis
Performance data emitted by trace driven
simulation is used as baseline
Any significant deviation indicates a
potential fault
Simulator selectively injects a set of
suspected faults and searches a set that
most produces the expected performance
An efficient algorithm is designed to
determine root causes
System Overview
6. Search for set of faults that result in best explanation
Link/Node
failure
Faults Directory
7. Report the
cause of failure
Link RSS
Link Load
Interference Injection
Traffic Simulator
Routing update
simulator
Error
Expected loss rate
Throughput noise
+/-
Topology changes
5. Discrepancy
Found
Loss rate
Throughput noise
1. Receive Cleaned Data
2. Drive Simulation
4. Compare Expected & Average
Performance
3. Compute Expected Performance
Why Simulation Based Diagnosis?
Much better insights into the network behavior
than any heuristic or theoretical technique
Highly customizable and applies to a large class
of networks
Ability to perform what-if analysis
Helps
to foresee the consequences of a corrective
action
Recent advances in simulators have made
possible their use for real-time analysis
Accurate modeling
Network model
Types of faults
Healthy network
Faults directory
…
Current network models
Bayesian networks to map symptom-fault
dependencies
Context Free Grammars
Correlation Matrix
Can on-line simulations be used as
core tool?
Building confidence in simulator
accuracy
Problem
Hard
to accurately model the physical layer and the
RF propagation
Traffic demands on the router are hard to predict
Building confidence in simulator
accuracy
Problem
Hard
to accurately model the physical layer and the
RF propagation
Traffic demands on the router are hard to predict
Solution
“after
the fact” simulation
Agents periodically report information about the link
conditions and traffic patterns to the link simulators
Simulations when the RF condition of the link is
good
Modeling the contention from flows within the
interference and communication ranges.
Modeling the overheads of the protocol stack such as
parity bits, MAC-layer back-off, IEEE 802.11 inter-frame
spacing and ACK, and headers.
Simulations with varying received signal strength
Simulator estimate deviates from real,
when signal strength is poor
Throughput matches closely with the simulator’s estimate,
when signal quality is good
Why simulation results deviate in
case of poor signal strength?
Lack of accurate packet loss as a function
of packet size, RSS and ambient noise.
Depends
on signal processing hardware and
the RF antenna within the wireless cards
Lack of accurate auto-rate control
Adjustment
of sending rate done by WLAN
cards based on the transmission conditions
How to model auto-rate control
done by WLAN cards?
Use Trace driven simulation
When auto-rate is in use
Collect
the rate at which the wireless card is
operating and provide the reported rate to the
simulator
Otherwise
Data
rate is known to the simulator
How to model accurate packet loss as
a function of packet-size, RSS and
ambient noise?
Use offline analysis
Calibrate the wireless cards and create a
database associating environmental
factors with expected performance
E.g.,
mapping from signal strength and noise
to loss rate
Experiment to model the loss rates
due to poor signal strength
Collect another set of traces
Slowly
send out packets
Place packet sniffers near both the sender
and the receiver, and derive loss rate from the
packet level trace
Seed the wireless link in the simulator with
a Bernoulli loss rate that matches loss rate
with the real traces
Estimated and measured throughput when
compensating for the loss rate due to poor
signal strength
Loss rate and the measured throughput do not
monotonically decrease with the signal strength
due to the effect of auto-rate
Even though the match is not perfect, its not expected to be a
problem, because
many routing protocols try to avoid the use of poor quality
links
Poor quality links are used only when certain parts of mesh
network have poor connectivity to the rest of the network
In a well-engineered network, not many nodes depend on
such bad link for routing
Stability of channel conditions
How rapidly do channel conditions change
and how often a trace should be
collected?
Temporal fluctuation in RSS
Fluctuation magnitude is not significant
Relative quality of signals across different
number of walls remain stable
Stability of channel conditions
How rapidly do channel conditions change
and how often a trace should be
collected?
When
the environment is generally static,
nodes may report only the average and
standard deviation of the RSS to the manager
every few minutes
Dealing with imperfect data
By neighborhood monitoring
Each
node reports performance and traffic statistics
for its incoming and outgoing links
And for other links in its communication range
Possible when node is in promiscuous mode
Thus multiple reports are sent for each link
Redundant reports can be used to detect
inconsistency
Find the minimum set of nodes that can explain
the inconsistency in the reports
Summary
How to accurately model the real behavior?
Solution: Use trace-based simulation
Problem: Simulation results are good for strong signals but deviate
for bad RF conditions
Need to model the autorate control
Need to model the loss rate due to poor signal strength
Use offline analysis
How often a trace should be collected?
Use trace-driven data
Very little data (average and standard deviation of RSS), at fairly low
time granularity, as channels are relatively stable
How to deal with imperfect data
By neighborhood monitoring
Fault diagnosis
Network model
Types of faults
Faulty network
Faults directory
…
Detected faults
Current fault diagnosis approaches
AI techniques
Rule
based systems
Neural networks
Model traversing techniques
Dependency
graphs
Causality graphs
Bayesian networks
Fault Isolation and Diagnosis
Establish the expected performance in the
simulation
Find difference between expected and
observed performance
Search over the fault space to detect
which set of faults can re-produce
performance similar to what has been
observed
Collecting data from traces
Trace data collection
Network topology
Traffic statistics
Each node maintains counters of traffic sent and received from
immediate neighbors
Physical medium
Each node reports its neighbor and routing tables
Each node reports signal strength of wireless links to neighbors
Network performance
Includes both the link and end-to-end performance, which can be
measured through loss rate, delay, throughputs
Focus is on link level performance
Simulating the network
performance
Traffic load simulation
Route simulation
Use actual routes taken by packets as input to the simulator
Wireless signal
Link based traffic simulation
Adjust application sending rate to match the observed link-level
traffic counts
Use real measurement of signal strength
Fault injection
Random packet dropping
External noise sources
MAC misbehavior
Fault diagnosis algorithm
General approach
Network settings
Simulator
Expected performance
Network settings
Simulator
Observed performance
Faults set
How to find ?
How to search the faults efficiently?
Different types of faults often change one
or few metrics
E.g.,
random dropping only affects link loss
rate
Thus use metrics in which observed and
expected performance is significantly
different, to guide the search
Scenario where faults do not have
strong interactions
Consider large deviation from
expected performance as
anomaly
Use decision tree to determine
the type of fault
Fault type determines the
metric to quantify performance
difference
Locate faults by finding the set
of nodes and links with large
difference between expected
and observed performance
Scenario where faults have strong
interactions
Get the initial diagnosis set from the decision
tree algorithm
Iteratively refine the fault set
Adjust the magnitudes of faults in the fault set
Translate difference in performance into change in faults’
magnitude
It maps the impact of a fault into its magnitude
Remove fault whose magnitude is too small
Add new faults that can explain large differences
between the expected and observed performances
Iterate till the change in fault set is negligible
Example scenario
2
3
1
4
5
Example scenario
Observed performance
• Increased loss rate at 1-4 and 1-2
• No increase in the sending rate of 1-4, 1-2
• No increase in noise experienced by
neighbors
2
3
Inference
1
Increased Sending Rate
4
5
Y
N
Too low CW
Increased Noise
Y
N
Noise
Increased Loss
Y
Packet Drop
N
Normal
Example scenario
Observed performance
• Increased loss rate at 1-4 and 1-2
• No increase in the sending rate of 1-4, 1-2
• No increase in noise experienced by
neighbors
2
3
Inference
1
Increased Sending Rate
4
5
Y
N
Too low CW
Increased Noise
Y
N
Noise
Increased Loss
Y
N
Packet dropping at node 1
Packet Drop
Normal
Accuracy of fault diagnosis
Correctness of the model
Complete
information
Consistent information
Timely information
Correctness of the reported symptoms
Right
size of the threshold to report a symptom
Difference in the behavior of faults
Timely reporting of symptoms
System implementation
Windows XP
Agents run on every wireless node and reports information collected
on demand
Managers collect and analyze information
Collected information is cast into performance counters supported
by Windows
Manager is connected to a backend simulator. Collected information
is converted to script to drive the simulation
Testbed:
Multihop wireless testbed built using IEEE 802.11a cards
Commercially available network sniffer called Airopeek is used for data
collection
Native 802.11 NICs provide rich set of networking information
Evaluation: Data collection
overhead
Overhead < 800 bits/s/node
Management traffic overhead
Data collection traffic has little effect
Performance of FTP flow with
and without data collection
No data cleaning: Each link is reported only once
With data cleaning: Each link is reported by all observers for consistency check
Data cleaning
effectiveness
Coverage greater than 80% in all cases
Higher accuracy with grid topology
Higher coverage when using history
Higher accuracy with denser networks
Higher accuracy with client-server traffic
Evaluation: Fault diagnosis
Detecting random dropping
•Symptom: Significant difference in loss rates in links
•Less than 20% of fault links are left
undetected
•No-effect faults are faulty links sending
less that threshold (250) packets of data
Detecting external noise
•Symptom: Significant difference in noise level in nodes
•Noise sources are correctly identified with
at most one or two false positives
•Inference error in magnitudes of noises is
within 4%
Evaluation: Fault diagnosis
Detecting MAC misbehavior
•Symptom: Significant discrepancy in throughput on links
•Coverage is mostly around 80% or higher
•False positives within 2
Detecting combinations of all
what-if analysis
Network model
Types of faults
Faults directory
Corrective measures
…
Detected faults
What-if analysis
Topology
Diagnosis
Corrective measures
Limitations
Limited by accuracy of the simulator
Time to detect the faults is acceptable for
detecting long term faults but not transient faults
Choices of traces to drive the simulation has
important implications
Focus has only been on faults resulting in
different behavior
Conclusion
Used trace data for modeling the network
Data collection techniques are presented to
collect network information and detect a
deviation from the expected performance
Fault diagnosis algorithm is proposed to detect
the root causes of failure
A scheme for what-if analysis is proposed to
evaluate alternative network configuration for
efficient network operation
Future work
Validation on a large test-bed
Performance analysis in presence of mobility
Detecting malicious attacks
Diagnosis in presence of incomplete network
information
More deeply investigating the potential of what-if
analysis
References
L. Qiu, P. Bahl, A. Rao, L. Zhou, Fault Detection, Isolation, and
Diagnosis in Multihop Wireless Networks, Microsoft Technical
Report, Microsoft Researh-TR-2004-11, Dec. 2003
M. Steinder, A. Sethi, A survey of fault localization techniques in
computer networks, Technical Report 2001, CIS Dept., Univ of
Delaware, Feb 2001
M. Steinder, Probabilistic inference for diagnosing service failures in
communication systems, PhD thesis, Univ. of Delaware, 2003
Questions
What is proposed solution to model the
throughput when the signal strength is poor? In
Table 2, the simulated throughput monotonically
decreases with the loss rate while the measured
throughput does not. Why?
What could be the causes of generation of false
positives in the fault diagnosis results? When
can the false positive ratio increase?
http://www.cis.udel.edu/~natu/861/861.html