Detecting Large-Scale System Problems by Mining Console Logs
Download
Report
Transcript Detecting Large-Scale System Problems by Mining Console Logs
Detecting Large-Scale
System Problems by
Mining Console Logs
Author : Wei Xu* , Ling Huang†, Armando Fox*
David Patterson* ,Michael Jordan*
Conference: ICML 2010, ACM SOSP2009
Advisor: Yuh-Jye Lee
Reporter: Yi-Hsiang Yang
Email: [email protected]
2
Outline
• Introduction
• Methodology
• Evaluation and Visualization
• Conclusion
3
Introduction
• Information of console logs?
Console logs rarely help in large-scale
datacenter services
Operational problems are dependent on the
deployment and runtime environment
Typical console log is much more structured
• Anomaly detection
Unusual log messages often indicate the
source of the problem
4
Workflow
• Log Parsing
Convert a log message from unstructured text
to a data structure
• Feature creation
Constructing the state ratio vector and the
message count vector features
• Anomaly detection
Principal Component Analysis(PCA)-based
anomaly detection method
• Visualization
Decision tree
5
Workflow
6
Log Parsing with Source Code
• Difficulty: Templatize automatically
C language
fprintf(LOG, "starting: xact %d is %s")
Java
CLog.info("starting: " + txn)
• Not easy to distinguish variables、states
7
Parsing Approach
-Source Code
• Generate the source code’s abstract
syntax tree (AST)
• Use AST to identify all method calls on
objects of the classes (or their subclasses)
• Deduce the types of variables in message
templates
8
Parsing Approach
-Source Code
9
Parsing Approach
-Log
• Apache Lucene reverse index
• Implement as a Hadoop map-reduce
job
Replicating the index to every node and
partitioning
The map stage performs the reverse-index
search
The reduce stage processing depends on the
features to be constructed
10
Parsing Approach
11
Feature Creation
• The state ratio vector
Each state ratio vector : a group of state variables in
a time window
• The message count vector
Each vector dimension : different message type
Value of the dimension : messages appear in the
message group
12
13
Feature Creation
-The message count vector
14
Anomaly Detection
-Principal Component Analysis (PCA)
15
Anomaly Detection
-Principal Component Analysis (PCA)
• Applied Term Frequency / Inverse Document
Frequency (TF-IDF)
• Replace each entry yi,j with a weighted entry
wi,j ≡ yi,j log(n/dfj), where dfj is total number of
message groups that contain the j-th message
type
16
Evalution and Visualization
• From Elastic Compute Cloud (EC2)
• 203 nodes of HDFS and 1 nodes of Darkstar
17
Evalution and Visualization
• Parse fails when cannot find a message template that
matches the message and extract message variables.
18
Evalution and Visualization
• 50 nodes, takes less than 3 minutes , less than
10 minutes with 10 node
19
Evalution and Visualization
-Darkstar
• DarkMud
Provided by the Darkstar team
Emulated 60 user clients in the DarkMud
virtual world performing random operations
Ran the experiment for 4800 seconds
Injected a performance disturbance by
capping the CPU during time 1400 to 1800 sec
20
Disturbance by capping the CPU
21
Evalution and Visualization
-Darkstar
• Ratio between number of ABORTING to
COMMITTING increases from about
1:2000 to about 1:2
• Darkstar does not adjust transaction
timeout accordingly
22
Evalution and Visualization
-Darkstar
• Augmented each feature vector using the
timestamp of the last message in that group
23
Evalution and Visualization
-Hadoop
24
Evalution and Visualization
-Hadoop
25
Evalution and Visualization
-Hadoop
26
Conclusion
• Using source code as a reference to understand
the structure of console logs are able to parse
logs accurately
• New opportunities for turning built-in console
logs into a powerful monitoring system for
problem detection
27
Thanks for your attention
Q&A