Agnostic Diagnosis: Discovering Silent Failures in Wireless Sensor Networks Xin Miao1, Kebin Liu1,2, Yuan He1,2, Yunhao Liu1, Dimitris Papadias1 1 Department of CSE, HKUST 2

Download Report

Transcript Agnostic Diagnosis: Discovering Silent Failures in Wireless Sensor Networks Xin Miao1, Kebin Liu1,2, Yuan He1,2, Yunhao Liu1, Dimitris Papadias1 1 Department of CSE, HKUST 2

Agnostic Diagnosis: Discovering Silent Failures
in Wireless Sensor Networks
Xin Miao1, Kebin Liu1,2, Yuan He1,2, Yunhao Liu1, Dimitris Papadias1
1 Department
of CSE, HKUST
2 TNLIST, SS, Tsinghua University
Reporter: Lin Guokun (10212745)
SIST of SYSU
2011-06-08
1/24
Agenda
Introduction
Motivation
Agnostic Diagnosis
Related Works
Conclusions
2/24
Introduction
• Wireless Sensor Networks(WSNs) have been used
in many fields
• Diagnosing WSNs is challenging issue
• Motivated by the need for long-term reliable
operation of GreenOrbs.
3/24
Contributions
• Agnostic Diagnosis relies on minimal a-priori
knowledge
• Put forward correlation graph(CG) that efficiently
characterizes the internal correlations inside a node
• Implement AD and evaluate it with traces from
330-node GreenOrbs deployment
4/24
Agenda
Introduction
Motivation
Agnostic Diagnosis
Related Works
Conclusions
5/24
Observation
Change dramatically
Possibly Faulty!
This example demonstrates that the individual examination of metrics on the same
sensor may overlook silent failures or may flag failure by mistake!
6/24
Observation
Seems Perform Well
Weak Correlation
This example suggests that even if considering multiple sensor nodes, an individual
metric is insufficient to uncover failures!
7/24
Motivation
• GreenOrbs
- Design a diagnostic system for GreenOrbs which
deploy in romote area that is expensive of
in-situation debugging and troubleshooting
• Observations
- Currently existing methods based on individual
metric cannot work well
8/24
Agenda
Introduction
Motivation
Agnostic Diagnosis
Related Works
Conclusions
9/24
Agnostic Diagnosis(AD)
Symbols
Definition
N
Number of sensors
p
Number of metrics
w
Window size
si
Sensor node i, l ≤ i ≤ N
Si,t
The p-dimensional status vector of sensor i at time t,
Si,t = (m1,t m2,t ...,mp,t)
mu,t
The value of the u-th metric at time t, where l ≤ u ≤ p
ck(u,v)
Correlation score of metrics u and v in time window k,
where l ≤ u, v ≤p
CGi,k
Correlation graph of sensor i in window k
10/24
AD-Correlation Graph
Correlation between metric u and v in each time window k,
we define the correlation score using Pearson's coefficient:
ck (u, v) 
w
w
w
i 1
i 1
i 1
w mu ,( k 1)*wi mv ,( k 1)*wi   mu ,( k 1)*wi  mv ,( k 1)*wi
 u , k v , k
where  u ,kand  v,k are their standard deviations:
 u ,k 
 v ,k 
w
w
w m u ,( k 1)*wi  ( mu ,( k 1)*w i ) 2
2
i 1
i 1
w
w
w m v ,( k 1)*wi  ( mv ,( k 1)*w i ) 2
2
i 1
i 1
And the Correlation Graph of sensor i in window k is as follows:
CGi ,k
 ck (1,1) ck (1,2)
 c (2,1) c (2,2)
k
 k
 ...
...

ck (1, p ) ck ( p,2)
... ck (1, p) 
... ck (2, p ) 
...
... 

... ck ( p, p)
ck (u, v) [1,1]
maps to
ck (u, v) [0,255]
11/24
AD Framework
• Temporal Detection
- Given two CGs of the same sensor i with
successive time windows. Abrupt change in CGs
usually indicates sensor node failure
• Spatial Detection
- Given a set of nodes shared similar patterns.
Diverge from the common patterns are
considered suspicious.
12/24
AD - Temporal Detection
• Cumulative Sum(CUSUM)
tim e series : {c1 (u, v), c2 (u, v),...,cn (u, v)}
n
CS 0  0, CS i  CS i 1  ci (u, v)   ci(u, v) / n
i 1
CS diff  max(CS i )  min(CS i )
• Bootstrap Analysis
'
- reorder time series, calculate CS diff
M times
'
among which X times CSdiff  CSdiff
X
- Confidence Level of CS diff is M
• Change Point Detection
X
- M   (e.g.   90%) indicates an abrupt change
- c (u, v) is the change point where CSk  maxCSi 13/24
k 1
Temporal Detection- CUSUM Example
Original Time Series
14/24
Temporal Detection- CUSUM Example
Original cumulative sum and cumulative sums after bootstrap
15/24
AD - Spatial Detection
• K-Means Clustering
- node set {s1 , s2 ,...,sn }with their CGs in window t
- K clusters with centroids C1 , C2 ,...,Ck
(dist (CGi ,t , C j ))
- suspicious confidence level of si is min
j
• Principal Component Analysis
- projection matrix P  (u1 , u2 ,...,um )T where u i
is d-dimensional column vector
- in author's evaluation, K=3 and m=15 works well
16/24
Spatial Detection - Routing Failure
Correlation Score of:
1. ParentChangeCounter
2. RadioOnCounter
The lighter the stronger the two metrics correlated!
17/24
Effectiveness & Traffic Overhead
• Effectiveness
- randomly pick 30 faulty nodes identified
- 23 of them failure, 5 false alarm, 2 - others
• Traffic Overhead
- Every 15 minutes, all the status information can
be packed into one packet and send back with
sensing data
18/24
Agenda
Introduction
Motivation
Agnostic Diagnosis
Related Works
Conclusions
19/24
Related Works - Software Debugging
• Clairvoyant
- a GDB-like source-level debugging tool
• Declarative Tracepoint
- allow insert action-associated rules to applications
at runtime
• DustMiner
- front-end to collect runtime event logs back-end to
perform frequent pattern mining. Uncover failure
cause and performance anomalies
20/24
Related Works - Rule-based Mining or Model
• Sympathy
- employ decision trees to analyse failure cause
• PAD
- leverage packet marking strategey for constructing
and maintaining the inference model
• PowerTracing
- employ a special power meter and HMM to identify
patterns of power consumption
21/24
Agenda
Introduction
Motivation
Agnostic Diagnosis
Related Works
Conclusions
22/24
Conclusions
• Presents a novel approach that discover silent
failure
• Relies on minimum domain knowledge
• Evaluates on a dataset collected from 330 nodes
over serval months
23/24
Thanks for your attention!
speaker - Lin Guokun(10212745)
2011-06-08
24/24