Fault Detection, isolation, and Diagnosis In Multihop

Transcript Fault Detection, isolation, and Diagnosis In Multihop

Fault Detection, Isolation,
and Diagnosis In Multihop
Wireless Networks
Lili Qiu, Paramvir Bahl, Ananth Rao, and Lidong
Zhou
Microsoft Research
Presented by
-Maitreya Natu
Network Management
Faults directory
…
Healthy network
Root cause
Corrective measure
Faulty network
Tasks involved in Network
Management





Continuously monitoring the functioning
Collecting information about the nodes and the
links
Removing inconsistencies and noise from the
reported information
Analyzing the information
Taking appropriate actions to improve network
reliability and performance
Challenges in wireless networks

Dynamic and unpredictable topology
 link
errors due to fluctuating environment
conditions
 Node mobility

Limited capacity
 Scarcity

of resources
Link attacks
Proposed framework
Reproduce inside a simulator, the realworld events that took place
 Use online trace driven simulation to
detect faults and analyze the root causes

Network Management
Network model
Types of faults
Healthy network
Faults directory
…
Creating a network model
Network Management
Network model
Fault diagnosis
Types of faults
Faulty network
Faults directory
…
Detected faults
Network Management
Network model
what-if analysis
Types of faults
Faults directory
Corrective measures
…
Detected faults
Key issues
How to Accurately reproduce what
happened in the network inside a
simulator
 How to build fault diagnosis on top of a
simulator to perform root cause analysis

Accurate modeling

Use real traces from the diagnosed
network
 Removes
dependency on generic theoretical
models
 Captures nuances of the hardware, software
and environment of the particular network

Collect good quality data
 By
developing a technique to effectively rule
out erroneous data
Fault diagnosis
Performance data emitted by trace driven
simulation is used as baseline
 Any significant deviation indicates a
potential fault
 Simulator selectively injects a set of
suspected faults and searches a set that
most produces the expected performance
 An efficient algorithm is designed to
determine root causes

System Overview
6. Search for set of faults that result in best explanation
Link/Node
failure
Faults Directory
7. Report the
cause of failure
Link RSS
Link Load
Interference Injection
Traffic Simulator
Routing update
simulator
Error
Expected loss rate
Throughput noise
+/-
Topology changes
5. Discrepancy
Found
Loss rate
Throughput noise
1. Receive Cleaned Data
2. Drive Simulation
4. Compare Expected & Average
Performance
3. Compute Expected Performance
Why Simulation Based Diagnosis?



Much better insights into the network behavior
than any heuristic or theoretical technique
Highly customizable and applies to a large class
of networks
Ability to perform what-if analysis
 Helps
to foresee the consequences of a corrective
action

Recent advances in simulators have made
possible their use for real-time analysis
Accurate modeling
Network model
Types of faults
Healthy network
Faults directory
…
Current network models
Bayesian networks to map symptom-fault
dependencies
 Context Free Grammars
 Correlation Matrix

Can on-line simulations be used as
core tool?
Building confidence in simulator
accuracy

Problem
 Hard
to accurately model the physical layer and the
RF propagation
 Traffic demands on the router are hard to predict
Building confidence in simulator
accuracy

Problem
 Hard
to accurately model the physical layer and the
RF propagation
 Traffic demands on the router are hard to predict

Solution
 “after
the fact” simulation
 Agents periodically report information about the link
conditions and traffic patterns to the link simulators
Simulations when the RF condition of the link is
good
Modeling the contention from flows within the
interference and communication ranges.
Modeling the overheads of the protocol stack such as
parity bits, MAC-layer back-off, IEEE 802.11 inter-frame
spacing and ACK, and headers.
Simulations with varying received signal strength
Simulator estimate deviates from real,
when signal strength is poor
Throughput matches closely with the simulator’s estimate,
when signal quality is good
Why simulation results deviate in
case of poor signal strength?

Lack of accurate packet loss as a function
of packet size, RSS and ambient noise.
 Depends
on signal processing hardware and
the RF antenna within the wireless cards

Lack of accurate auto-rate control
 Adjustment
of sending rate done by WLAN
cards based on the transmission conditions
How to model auto-rate control
done by WLAN cards?
Use Trace driven simulation
 When auto-rate is in use

 Collect
the rate at which the wireless card is
operating and provide the reported rate to the
simulator

Otherwise
 Data
rate is known to the simulator
How to model accurate packet loss as
a function of packet-size, RSS and
ambient noise?
Use offline analysis
 Calibrate the wireless cards and create a
database associating environmental
factors with expected performance

 E.g.,
mapping from signal strength and noise
to loss rate
Experiment to model the loss rates
due to poor signal strength

Collect another set of traces
 Slowly
send out packets
 Place packet sniffers near both the sender
and the receiver, and derive loss rate from the
packet level trace

Seed the wireless link in the simulator with
a Bernoulli loss rate that matches loss rate
with the real traces
Estimated and measured throughput when
compensating for the loss rate due to poor
signal strength
Loss rate and the measured throughput do not
monotonically decrease with the signal strength
due to the effect of auto-rate
Even though the match is not perfect, its not expected to be a
problem, because

many routing protocols try to avoid the use of poor quality
links

Poor quality links are used only when certain parts of mesh
network have poor connectivity to the rest of the network

In a well-engineered network, not many nodes depend on
such bad link for routing
Stability of channel conditions

How rapidly do channel conditions change
and how often a trace should be
collected?
Temporal fluctuation in RSS


Fluctuation magnitude is not significant
Relative quality of signals across different
number of walls remain stable
Stability of channel conditions

How rapidly do channel conditions change
and how often a trace should be
collected?
 When
the environment is generally static,
nodes may report only the average and
standard deviation of the RSS to the manager
every few minutes
Dealing with imperfect data

By neighborhood monitoring
 Each
node reports performance and traffic statistics
for its incoming and outgoing links
 And for other links in its communication range




Possible when node is in promiscuous mode
Thus multiple reports are sent for each link
Redundant reports can be used to detect
inconsistency
Find the minimum set of nodes that can explain
the inconsistency in the reports
Summary

How to accurately model the real behavior?


Solution: Use trace-based simulation
Problem: Simulation results are good for strong signals but deviate
for bad RF conditions

Need to model the autorate control


Need to model the loss rate due to poor signal strength


Use offline analysis
How often a trace should be collected?


Use trace-driven data
Very little data (average and standard deviation of RSS), at fairly low
time granularity, as channels are relatively stable
How to deal with imperfect data

By neighborhood monitoring
Fault diagnosis
Network model
Types of faults
Faulty network
Faults directory
…
Detected faults
Current fault diagnosis approaches

AI techniques
 Rule
based systems
 Neural networks

Model traversing techniques
 Dependency
graphs
 Causality graphs
 Bayesian networks
Fault Isolation and Diagnosis
Establish the expected performance in the
simulation
 Find difference between expected and
observed performance
 Search over the fault space to detect
which set of faults can re-produce
performance similar to what has been
observed

Collecting data from traces

Trace data collection

Network topology


Traffic statistics


Each node maintains counters of traffic sent and received from
immediate neighbors
Physical medium


Each node reports its neighbor and routing tables
Each node reports signal strength of wireless links to neighbors
Network performance


Includes both the link and end-to-end performance, which can be
measured through loss rate, delay, throughputs
Focus is on link level performance
Simulating the network
performance

Traffic load simulation



Route simulation


Use actual routes taken by packets as input to the simulator
Wireless signal


Link based traffic simulation
Adjust application sending rate to match the observed link-level
traffic counts
Use real measurement of signal strength
Fault injection



Random packet dropping
External noise sources
MAC misbehavior
Fault diagnosis algorithm

General approach
Network settings
Simulator
Expected performance
Network settings
Simulator
Observed performance
Faults set
How to find ?
How to search the faults efficiently?

Different types of faults often change one
or few metrics
 E.g.,
random dropping only affects link loss
rate

Thus use metrics in which observed and
expected performance is significantly
different, to guide the search
Scenario where faults do not have
strong interactions




Consider large deviation from
expected performance as
anomaly
Use decision tree to determine
the type of fault
Fault type determines the
metric to quantify performance
difference
Locate faults by finding the set
of nodes and links with large
difference between expected
and observed performance
Scenario where faults have strong
interactions


Get the initial diagnosis set from the decision
tree algorithm
Iteratively refine the fault set
 Adjust the magnitudes of faults in the fault set
 Translate difference in performance into change in faults’
magnitude
 It maps the impact of a fault into its magnitude
 Remove fault whose magnitude is too small
 Add new faults that can explain large differences
between the expected and observed performances

Iterate till the change in fault set is negligible
Example scenario
2
3
1
4
5
Example scenario
Observed performance
• Increased loss rate at 1-4 and 1-2
• No increase in the sending rate of 1-4, 1-2
• No increase in noise experienced by
neighbors
2
3
Inference
1
Increased Sending Rate
4
5
Y
N
Too low CW
Increased Noise
Y
N
Noise
Increased Loss
Y
Packet Drop
N
Normal
Example scenario
Observed performance
• Increased loss rate at 1-4 and 1-2
• No increase in the sending rate of 1-4, 1-2
• No increase in noise experienced by
neighbors
2
3
Inference
1
Increased Sending Rate
4
5
Y
N
Too low CW
Increased Noise
Y
N
Noise
Increased Loss
Y
N
Packet dropping at node 1
Packet Drop
Normal
Accuracy of fault diagnosis

Correctness of the model
 Complete
information
 Consistent information
 Timely information

Correctness of the reported symptoms
 Right
size of the threshold to report a symptom
 Difference in the behavior of faults
 Timely reporting of symptoms
System implementation






Windows XP
Agents run on every wireless node and reports information collected
on demand
Managers collect and analyze information
Collected information is cast into performance counters supported
by Windows
Manager is connected to a backend simulator. Collected information
is converted to script to drive the simulation
Testbed:


Multihop wireless testbed built using IEEE 802.11a cards
Commercially available network sniffer called Airopeek is used for data
collection
 Native 802.11 NICs provide rich set of networking information
Evaluation: Data collection
overhead
Overhead < 800 bits/s/node
Management traffic overhead
Data collection traffic has little effect
Performance of FTP flow with
and without data collection
No data cleaning: Each link is reported only once
With data cleaning: Each link is reported by all observers for consistency check
Data cleaning
effectiveness
Coverage greater than 80% in all cases
Higher accuracy with grid topology
Higher coverage when using history
Higher accuracy with denser networks
Higher accuracy with client-server traffic
Evaluation: Fault diagnosis
Detecting random dropping
•Symptom: Significant difference in loss rates in links
•Less than 20% of fault links are left
undetected
•No-effect faults are faulty links sending
less that threshold (250) packets of data
Detecting external noise
•Symptom: Significant difference in noise level in nodes
•Noise sources are correctly identified with
at most one or two false positives
•Inference error in magnitudes of noises is
within 4%
Evaluation: Fault diagnosis
Detecting MAC misbehavior
•Symptom: Significant discrepancy in throughput on links
•Coverage is mostly around 80% or higher
•False positives within 2
Detecting combinations of all
what-if analysis
Network model
Types of faults
Faults directory
Corrective measures
…
Detected faults
What-if analysis
Topology
Diagnosis
Corrective measures
Limitations




Limited by accuracy of the simulator
Time to detect the faults is acceptable for
detecting long term faults but not transient faults
Choices of traces to drive the simulation has
important implications
Focus has only been on faults resulting in
different behavior
Conclusion




Used trace data for modeling the network
Data collection techniques are presented to
collect network information and detect a
deviation from the expected performance
Fault diagnosis algorithm is proposed to detect
the root causes of failure
A scheme for what-if analysis is proposed to
evaluate alternative network configuration for
efficient network operation
Future work





Validation on a large test-bed
Performance analysis in presence of mobility
Detecting malicious attacks
Diagnosis in presence of incomplete network
information
More deeply investigating the potential of what-if
analysis
References



L. Qiu, P. Bahl, A. Rao, L. Zhou, Fault Detection, Isolation, and
Diagnosis in Multihop Wireless Networks, Microsoft Technical
Report, Microsoft Researh-TR-2004-11, Dec. 2003
M. Steinder, A. Sethi, A survey of fault localization techniques in
computer networks, Technical Report 2001, CIS Dept., Univ of
Delaware, Feb 2001
M. Steinder, Probabilistic inference for diagnosing service failures in
communication systems, PhD thesis, Univ. of Delaware, 2003
Questions



What is proposed solution to model the
throughput when the signal strength is poor? In
Table 2, the simulated throughput monotonically
decreases with the loss rate while the measured
throughput does not. Why?
What could be the causes of generation of false
positives in the fault diagnosis results? When
can the false positive ratio increase?
http://www.cis.udel.edu/~natu/861/861.html