Transaction Process Monitoring

download report

Transcript Transaction Process Monitoring

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

Learning, Indexing and Diagnosing Network Faults

Ting Wang

, Mudhakar Srivatsa

, Dakshi Agrawal

and Ling Liu

Georgia Institute of Technology

IBM T.J. Watson Research Center

‡ © 2008 IBM Corporation

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

Complex Networks

 Network as a graph – Vertices represent network entities – Edges represent pair-wise (local) interactions between network entities  Even simple interactions give rise to complex global network phenomena – Fault cascading in communication networks – – Information spread (e.g., via emails) in social networks Infection propagation in protein interaction networks

2

Key challenge is to detect and understand emerging global phenomena

© 2008 IBM Corporation

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

Network Monitoring Data

 Networks generate massive monitoring data (aka events) – – Monitored data consists of local (in both space & time) observations on the network Monitored data is incomplete and sometimes even erroneous (e.g., imprecise, out-of-order wrt to both time and causality, etc)  Examples – – – Ping failure, interface down, high CPU utilization, etc. in communication networks Email threads (time stamp, tokenized subject, MIME type, etc.) between members in a organizational hierarchy Pathological symptoms in biological networks – protein interaction networks (PINs)  Key observation: monitoring data gathered from network entities are correlated through the network topology

3

© 2008 IBM Corporation

 ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009 

Network Patterns

Network patterns attempt to efficiently capture spatial (topological) and temporal correlations in monitored data Key challenges – – – – Understand the semantics of network patterns Identify domain-specific network patterns (e.g., fault diagnosis & prediction in IT systems, information spread and access control on social networks, disease propagation in protein networks, etc) How to learn and represent network patterns?

How to scalably match network patterns against an online stream of network events?

e 3 e 1 e 2 iBGP server Director D Person P e 1 OSPF networks N 1 and N 2 Employees N 1 and N 2 Friends N 1 and N 2 Update configuration  withdraw prefix announcement Meeting with D and N 1 P updates a blog on her facebook page e Simplified Examples 2 N 1 says N 2 is not reachable Email from N 1 to N 2 N 1 sends friend request to N 2 e 3 N 2 says N 1 is not reachable N 2 updates project design document N 2 views P’s updates and accepts N 1 ’s friend request

4

© 2008 IBM Corporation

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009 

Network Patterns

Notation and Formalism – – – Event data: Network Pattern: INTERFACE DOWN  t 13 e 1 t 11 t 12 t e 2 22 t 23 e 3 t 33 Temporal Pattern: Markov Chain  Temporal Pattern – E.g.: markov chains, frequent item sets Temporal Pattern: Frequent Item Sets  Spatial Pattern: Composition/Closures of one or more topological relationships – – – Communication networks: upstream, downstream, neighbor, tunnel Social networks: manages, friends, team members, IM buddies Biological network: catalyst, inhibitor, suppressor

5

Spatial Pattern: Downstream (transitive closure) © 2008 IBM Corporation

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

Fault Diagnosis and Prediction in Communication Networks

 Challenges: improve scalability & expressiveness of fault-diagnosis – Limitation of current solutions: a complexity that grows as square of the network size – Correlation rules are pair-wise: expensive to support complex fault diagnosis (e.g., predicting soft failures, router failure from VRF tunnel events, etc) – Lacks predictive capability  Approach: – Fault signatures encode temporal patterns: frequent item sets, Markov chains; and topological patterns (spans the network): upstream, downstream, neighbors, VPN tunnels, etc – Topologically index streaming monitoring data to facilitate scalable single-pass event correlation and fault-diagnosis – Results in linear complexity – increased scalability Topology Correlation Engine (ITNM RCA) Topological Index Fault diagnosis Monitoring Data (Omnibus) Pair-wise correlation rules Fault Signatures (Network Patterns) Traditional RCA Engine vs. Proposed Approach Complexity: Monitoring data x Monitoring data x Rules Monitoring data x Network Diameter x Signatures Monitoring data ~ linear in network size Network diameter ~ logarithmic in network size for power-law networks © 2008 IBM Corporation

7

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

Step 1: Learning Network Faults

 Learn fault signatures from historical network event data – – – – Fault Synopsis: Fault Type  Network Pattern Fault Signature: Network Pattern  Fault Diagnosis: Fault Prediction: Use incrementally matchable network patterns  Faulty Node  Use indexable network patterns – Topological relationships are invertible: neighbor -1 = neighbor, downstream -1 = upstream Fault Type f 1 f 2 up-stream c 1 c 2 down-stream c 2 c 4 neighbor c 3 c 1 … … … Fault Synopsis Network Pattern c 1 c 2 c 3 c 4 up-stream f 1 , p 1 f 2 , p 2 down-stream f 1 , p 1 f 2 , p 2 Neighbor f 2 , p 2 f 1 , p 1 5/2/2020 … … … … … Fault Signature © 2008 IBM Corporation

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

Step 2: Online Matching

 Fault localization using topological indices and hierarchical evidence aggregation – – Topology indexing algorithms + space-time trade off in computing R(x) and R -1 (x) • R Є {upstream, downstream, neighbor, tunnel, …} Scalable hierarchical evidence aggregation for efficient fault diagnosis Network Pattern c 1 c 2 c 3 up-stream Device Down down-stream f 2 neighbor f 1 Device Down VPN Tunnel Device Down bf

f

1

f

bf 2 …...

bf

f

3 …...

bf

...

bf

f

n-1 …...

bf

f

n c 3 n 2 c 2 c 1 n 1 Evidence Aggregation Scalable Hierarchical Evidence Aggregation © 2008 IBM Corporation

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

Details

Interval Filter: segment event dataset into event bursts Support Filter: eliminate high frequency (regular n/w ops) and low frequency burst sets (noise) Periodicity Filter: eliminate burst sets with high periodicity (maintenance ops) Markov chains and maximum likelihood estimation Set of topological relationships: SE, NE, DS, US, TN Principle of minimum explanation Preparation of training data Extract temporal patterns Extract topological patterns Fault Signatures Event Datasets Network Topology OFFLINE LEARNING

9

ONLINE MATCHING Event Stream Min-Heap + incremental pattern matching Match temporal patterns time lookup Fault Signatures Inverted Index for constant Evidences: Network Topology Indexed network topology Scalable Evidence Aggregation Space-Time tradeoffs Fault Diagnosis and Prediction BIRCH data structure (hierarchical aggregation) Optimizations: filter-and refine (Bloom filter) + slotted aggregation (BIGTABLE) © 2008 IBM Corporation

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

Fault Diagnosis & Prediction: Scalability

 Result Summary:     SNMP Trap messages from a large enterprise (7 ASes, 32 IGP networks, 871 subnets, 1,268 VPN tunnels, 2,068 main nodes, 18,747 interfaces and 192,000 entities) over 14 days in 2007 Topology dataset – European backbone network (2,383 main nodes, spans 7 countries, 11 ASes and over 100,000 entities) Network fault simulator and monitoring data generation Linear scalability; further optimizations: prune and-search; slotted hierarchical aggregation 14 12 10 8 6 4  Ongoing activities   Integration with IBM Tivoli Network Management suite (ITNM) for live testing and fine-tuning Network patterns for access control on information flows over : (i) ENRON email data & organization role topology; (ii) Smallblue data & social + information network topology 2 0 0 0,02 0,04 0,06

Fault Rate

0,08 0,1 Basic Opt 1 Opt 1, 2

10

© 2008 IBM Corporation

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

Summary

 Network patterns encode spatial-temporal properties of various networks – Ability to scalably mine and match network patterns is key for understanding global network phenomena  Case study on fault diagnosis and prediction in communication networks – Complexity of solution has to be linear in network size – Topologically indexed databases was a key tool for addressing scalability

11

 Explore more complex network patterns for information, social and biological networks which exhibit stronger coupling relationships – A failed router does not cause its neighboring router to fail – – A corrupt information node can corrupt its neighbor (e.g., summary node) A diseased enzyme can catalyze/inhibit its neighbors © 2008 IBM Corporation

12

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

Questions?

Mudhakar Srivatsa

[email protected]

© 2008 IBM Corporation