Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP.
Download
Report
Transcript Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP.
Detecting, Managing, and
Diagnosing Failures with FUSE
John Dunagan, Juhan Lee (MSN),
Alec Wolman
WIP
Goals & Target Environment
Improve the ability of large internet
portals to gain insight into failures
Non-goals:
masking failures
use machine learning to infer
abnormal behavior
2
MSN Background
Messenger, www.msn.com, Hotmail, Search,
many other “properties”
Large (> 100 million users)
Sources of Complexity:
multiple data-centers
large # of machines
complex internal network topology
diversity of applications and software
infrastructure
3
The Plan
Detecting, managing, and diagnosing
failures
Review MSN’s current approaches
Describe our solution at a high level
4
Detecting Failures
Monitor system availability with heartbeats
Monitor applications availability & quality of service
using synthetic requests
Customer complaints
Telephone, email
Problems:
These approaches provide limited coverage – harder to
catch failures that don’t affect every request
Data on detected failures often lacks necessary detail to
suggest a remedy:
which front end is flaky?
which app component caused end-user failure?
5
Managing Failures
Definition:
When server “x” fails, what is the impact of
this failure?
Ability to prioritize failures
Detect component service degradation
Characterizing app-stability
Capacity planning
Better use of ops and engineering resources
Current approach: no systematic attempt to
provide this functionality
6
Our solution (in 2 steps)
Detecting and Managing Failures
Step 1: Instrument applications to track
user requests across the “service chain”
Each request is tagged with a unique id
Service chain is composed on-the-fly with
help of app instrumentation
For each request:
Collect per-hop performance information
Collect per-request failure status
Centralized data collection
7
What kinds of failures?
We can handle:
Machine failures
Network connectivity problems
Most:
Misconfiguration
Application bugs
But not all:
Application errors where app itself
doesn’t detect that there is a problem
8
Diagnosing Failures
Assigning responsibility to a specific hw or
sw component
Insight into internals of a component
Cross component interactions
Current approach: instrument applications
App-specific log messages
Problems
High request rates => log rollover
Perceived overhead => detailed logging enabled
during testing, disabled in production
9
Fuse Background
FUSE (OSDI 2004): lightweight
agreement on only one thing: whether
or not a failure has occurred
Lack of a positive ack => failure
10
Step 2: Conditional Logging
Step 2: Implement “conditional logging” to
significantly reduce the overhead of collecting
detailed logs across different machines in the
service chain
Step 1 provides ability to identify a request across all
participants in the service chain, Fuse provides agreement
on failure status across that chain
While fate is undecided: Detailed log messages stored in
main memory
Common case overload of logging is vastly reduced
Once the fate of service chain is decided, we discard app
logs for successful requests and save logs for failures
Quantity of data generated is manageable, when most
requests are successful
11
Example
Client
Server1
Server2
Server3
X
Benefits:
FUSE allows monitoring of real transactions.
When a request fails, FUSE provides an audit trail
All transactions, or a sampled subset to control
overhead.
How far did it get?
How long did each step take?
Any additional application specific context.
FUSE can be deployed incrementally.
12
Issues
Overload policy: need to handle bursts
of failures without inducing more
failures
How much effort to make apps FUSE
enabled?
Are the right components FUSE
enabled?
Identifying and filtering false positives
Tracking request flow is non-trivial with
network load balancers
13
Status
We’ve implemented FUSE for MSN,
integrated with ASP.NET rendering
engine
Testing in progress
Roll-out at end of summer
14
Backups
15
FUSE is Easy to Integrate
Example current code on Front End:
ReceiveRequestFromClient(…) {
…
SendRequestToBackEnd(…);
}
Example code on Front End using FUSE:
ReceiveRequestFromClient(…, FUSEinfo f) { // default value of f = null
if ( f != null ) JoinFUSEGroup( f );
…
SendRequestToBackEnd(…, f );
}
Current implementation is in C#, and consists of 2400 LOC
16