Lightweight Task Graph Inference for Distributed Applications

Download Report

Transcript Lightweight Task Graph Inference for Distributed Applications

Lightweight Task Graph Inference for Distributed Applications
Bin Xin, Patrick Eugster, Xiangyu Zhang
Jinlin Yang
Dept. of Computer Science
Purdue University
Center for Software Excellence
Microsoft Corp.
{xinb, peugster, xyzhang}@cs.purdue.usc
[email protected]
2010 29th IEEE International Symposium on Reliable Distributed Systems
Introduction
•
New Challenges to reliability as applications move
to Cloud
•
•
Distinct corporate entities managing the infrastructure
and the owing the application deployed
Application developer do not have access to lower level
debugging information in case of failures/faults.
• Depends on Application output or app level custom
Logs for diagnosis
•
Goal: Describe the high-level structural view of a
distributed program execution to facilitate easy
“after the fact” diagnosis.
Contributions
•
Define abstraction for representing distributed
executions – “Tasks”
•
A lightweight approach to generate “Task Graphs”
from the application event logs.
•
A declarative formulation of the rules to generate
Task Graphs using Prolog.
•
Demonstrate use of Task Graph to help understand
the distributed execution including anomaly
detection.
Relevance to SmartGrid and CiC
•
Extensions
•
Fault Detection by real-time log processing (CEP?)
• The patterns for CEP can be defined by the
application developer
• OR can be auto-generated using code augmentation
and static code analysis.
•
•
On fault-detection, the task graph can be used to decide
“recovery” mechanisms (other than naïve restart process
strategy)
Shortcomings
•
•
Do not explicitly consider the “Data Repository”
• Considered only as one of the ‘tasks’.
Not sure how it handles Transactions
Definitions
Event: is the execution of an operation that sends (or receives) data/signal to a different
thread/process (Smallest building blocks)
Signaling Event: is the operation of Sending
Acting Event: is the operation of Receiving
Happens Before (a e b):
partial ordering of events. A is the Sender and B
is the receiver who acts on that signal.
Task: Autonomous computation within a thread between to “acting” events. [Astart, Aend)
Task contains exactly one Acting Event
Zero or more Signaling Event
Task Graph:
A DAG whose nodes are tasks and edges represent Happens
Before relations
A Request: A pair of signaling and acting events, where the signaling event is
originating from outside the System.
A Reply: A pair of signaling and acting events, where the Acting event is triggered
outside the System.
E2E service Graph:
Example
System Setup
•
•
•
Uses HDFS as the example application on Cloud
HDFS logs are not sufficient/standardized
Uses Instrumentation using a tool called “AspectJ”
• AspectJ lets the developer insert code based on specific
“rules” during compilation
• Each event is logged as a 7-field tuple
• (EventID, ProcID, threadID, SourceLocation, Type,
Tag, Value)
Constructing Task Graphs (Prolog formulation) I
Events
A “Fact” to parse and
store all events
An entry for hb is
made only if the
Rules on the right
are true for events X
&Y
Constructing Task Graphs (Prolog formulation) II
Tasks
Issues & Solutions - I
Problem:
False +ves caused by Common Sycn Objects
Notion of “Time” is required. But Global Clocks or
Vector Clocks are expensive and complex.
Proposed
Solution:
Heuristic: Use the order of events in the event logs.
Issues & Solutions - II
Problem:
False +ves caused by Communication
Multiple Writes on the same Socket.
Proposed
Solution:
Heuristic: Use “Packet Size” and Total Received so far
to decide which write to associate to which reads.
Issues & Solutions - III
Problem:
False -ves caused by Gaurded Waits
Multiple waiting threads are notified and the Lock
Condition is updated before the current thread’s
execution. Hence a Condition Check is required
after waking up.
Proposed
Solution:
Manually update such cases and remove augmented
code within the loop and Add a marker just after
the loop.
Evaluation - I
Performance Impact
Runtime:
22.2% increase in binary size
38% increase in execution time
TaskGraph building using Prolog:
Evaluation – II (Demo)
To Help a new HDFS developer to analyze HDFS
Execution
Relevance to SmartGrid and CiC
•
Extensions
•
Fault Detection by real-time log processing (CEP?)
• The patterns for CEP can be defined by the
application developer
• OR can be auto-generated using code augmentation
and static code analysis.
•
•
On fault-detection, the task graph can be used to decide
“recovery” mechanisms (other than naïve restart process
strategy)
Shortcomings
•
•
Do not explicitly consider the “Data Repository”
• Considered only as one of the ‘tasks’.
Not sure how it handles Transactions