Magpie: Distributed request tracking for realistic performance modelling Rebecca Isaacs Paul Barham Richard Mortier Dushyanth Narayanan Microsoft Research Cambridge James Bulpin University of Cambridge 12 November 2003
Download
Report
Transcript Magpie: Distributed request tracking for realistic performance modelling Rebecca Isaacs Paul Barham Richard Mortier Dushyanth Narayanan Microsoft Research Cambridge James Bulpin University of Cambridge 12 November 2003
Magpie: Distributed request
tracking for realistic
performance modelling
Rebecca Isaacs
Paul Barham
Richard Mortier
Dushyanth Narayanan
Microsoft Research Cambridge
James Bulpin
University of Cambridge
12 November 2003
Performance in distributed systems
Faults in distributed systems are notoriously hard to
diagnose
Performance problems are even more subtle to
debug
Often transient or affect only a subset of requests / users
Frequently involve complex interactions between multiple
machines
Aggregate statistics (e.g. utilization) may look perfectly
normal
12 November 2003
Magpie Approach
Track individual requests end to end
Observe control flow (causality)
Monitor resource consumption: CPU, bandwidth, disk
Debug performance “in the small”
Build a probabilistic workload model from the
aggregate requests
Cluster similar requests according to their observed
behaviour
Debug performance “in the large”
12 November 2003
How do we use this information?
Performance debugging
Why did this request take much longer than that
request?
Fault detection
Configuration and management
Performance prediction
Realistic workload models for capacity planning
Obtain automatically on a “live” system
12 November 2003
Magpie components
Instrumentation
Generic request parser
System activity recorded to logs
Extract individual requests from logs according to
an event schema
Model construction
Behavioural clusters
Probabilistic state machine
12 November 2003
Outline
Introduction
What is a request?
Instrumentation
Request extraction
Modelling
Current status
12 November 2003
What is a request?
System activity which takes place in response to an
action initiated by the application being traced
HTTP request
Database query
File open request
We describe a request as
The sequence of application components involved in its
processing
The resource consumed at each stage
CPU, bandwidth, disk transfer size, (latency)
12 November 2003
A typical e-commerce site (1)
Internet
Storage
SQL Servers
Web Front Ends
12 November 2003
A typical e-commerce site (2)
SQL Server
Web Server
CLR
Application
Logic
Filter
IIS
Static
Content
ASP.NET
Stored
procedures
ADO.NET
Data
WinSock2 API
http.sys
12 November 2003
Kernel
WinSock2 API
Kernel
HTTP request: detailed view
ASP.NET thread blocks after
RPC to database
Sync WinSock send
to SQL Server
IIS worker thread
wakes up to write log
!
IIS worker thread
picks up request
from http.sys
-
+
-
+
-
+
-
+
WEB.eec
WEB.398
HTTP
request Disk
packet
Net RX
ASP.NET worker
thread takes over
HTTP response packets
sent back to client
TDS request and reply
packets sent and
received
Net TX
10.051s
10.100s
10.155s
10.100s
10.155s
Net TX
SQL.9c4
-
SQL thread
unblocks
-
Disk
-
Net RX
10.051s
KEY:
12 November 2003
Blocked
IIS
ASP.NET
SQL
Disk
Other
Why is request tracking hard?
Many components, multiple machines
No globally unique request ID
Many threads participate in processing a request
Asynchronous communication
Components are developed independently
Multiple thread pools
Must track control flow across machines
Must match send/recvs between threads/machines
Hand-rolled synchronization primitives
SQL server has user-mode scheduler
12 November 2003
Outline
Introduction
What is a request?
Instrumentation
Request extraction
Modelling
Current status
12 November 2003
Event Tracing for Windows
Low-overhead event mechanism
Events timestamped with cycle counter
Global ordering on events on a single machine
Can enable/disable sets of events at runtime
Using ETW in Magpie
Each instrumentation point posts an event
Events are logged to disk
Logs are post-processed to extract requests
Can also consume events in real time
12 November 2003
Instrumentation points
Existing ETW event providers
App-specific hooks
IIS, ASP.NET, SQL Server
Detours
IIS, kernel
Wrap dlls to trap Win32 and WinSock2 calls
WinPcap
Capture packets on the wire
12 November 2003
CPU usage from kernel events
The ETW kernel logger records every context
switch
How do we know which cycles are used for which
request?
We can attribute cycles to a request by
An application-specific event which occurs within
a delimited sector of CPU time, or
The current context of execution, eg thread id
12 November 2003
Example: protocol processing
in a DPC
Events:
DPC
cswitch start
Request 1
cycle count
Request 2
cycle count
12 November 2003
pkt
recv
DPC
end
cswitch
time
Application and middleware
events
Cover points where flow of control moves
between components
Cover points where resources are
multiplexed and demultiplexed
E.g. user-level scheduling primitives
Propagation of a global request id is not
required!
Magpie used to do this but not any more
12 November 2003
Instrumenting a web service
SQL Server
Web Server
CLR
Static
Content
Wrappers
Application
Logic
Filter
IIS
HTTPModule
ASP.NET
ISAPI Filter
Data
CLR profiler
Intercept
WinSock2 API
Kernel
Kernel
Event Tracing for Windows
Packet
capture
12 November 2003
Stored
procedures
ADO.NET
Intercept
WinSock2 API
http.sys
Extended SPs
Event Tracing for Windows
Packet
capture
Outline
Introduction
What is a request?
Instrumentation
Request extraction
Modelling
Current status
12 November 2003
Generic request extraction
No inbuilt assumptions about the system or
the application
Schema specifies semantics of events
No common unique identifier
Easy to add new event types
Parser stitches events into requests based on
event semantics
12 November 2003
Terminology
Namespace
Timeline
Event parameter which references an entity in the
system, eg thread id
Instantiation of a namespace with a unique value,
eg thread id = 0xa
Events bind or unbind requests to timelines
Bindings capture the semantics of each event for
a particular request type
12 November 2003
cswitch
DPC end
TCP pkt
DPC start
cswitch
Enter Recv
Recv returns
Example: connecting events
Cpuid=0
Tid=0xa
Tid=0xb
Connid=0xd
12 November 2003
Request 1
Request 2
End-to-end request extraction
An instance of the request parser runs on
each machine in the distributed system
Online or offline mode
Offline post-processing connects request
fragments from each node according to a
globally unique namespace, e.g. packet IP
identifier
12 November 2003
Outline
Introduction
What is a request?
Instrumentation
Request extraction
Modelling
Current status
12 November 2003
Clustering for workload
generation
Target the Indy performance modelling tool
Previously: microbenchmark approach
Calculates throughput, bottlenecks
Needs transaction mix, resource consumption
Run 10000 of each “transaction type” (URL)
Divide aggregate resource usage by 10000
Aim: provide realistic workload models
From real, mixed workloads
Derive transaction “types” automatically
12 November 2003
Single request: cartoon view
Partial ordering of events
Annotated with resource usage
1k
12k
5ms
6ms
3ms
1k
192k
read
1ms
6ms
3ms
SQL Server CPU
ASP.NET CPU
Network
12 November 2003
6ms
6k
2ms
IIS CPU
24k
read
Disk
Behavioural clustering of
requests
Represent requests as event strings
Use Levenshtein string edit distance
Modified to factor in resource usage vectors
Cluster requests based on this distance
“Flatten” out any concurrency
Linear-time algorithm
Each cluster is a request “type”
Select representative from near centroid
12 November 2003
Build a workload model by
clustering similar requests
1k
30k
A
7%
Requests in the same
cluster often have
different URLs, and
one URL may appear
in many clusters
2ms 10ms
0.2k
5ms
0.1k
14ms
0.2k
5ms
6k
11ms
5ms
24ms 1ms
0.2k
2k
5ms
1k
11k
B
14ms 27ms 2ms
1ms
2ms
7ms
10%
A
E
B
1k
12k
C
C
5ms
6ms
1k
6k
2ms
15%
D
3ms
192k
read
3ms
1k
11k
D
5%
0.6k
1k
12 November 2003
E
63%
5ms 11ms
2ms 13ms
0.3k
11ms 3ms 2ms 5ms
0.3k
5ms
24k
read
6ms
1ms
6ms
Taking it further: work-inprogress
Online and incremental modelling:
More sophisticated models
Detect component failure
Detect sudden shifts in workload
Learn the probabilistic state machine for each request
c.f. flowcharts annotated with performance information
“Bayesian watchdogs”
Compute the likelihood of a request’s behaviour as it
moves through the system
Deal with “unlikely” requests appropriately
12 November 2003
Outline
Introduction
What is a request?
Instrumentation
Request extraction
Modelling
Current status
12 November 2003
Current status
Recent focus has been developing a generic
request extraction scheme
Prototype for 2-machine e-commerce site
Prototype for single machine SQL Server 2000
TPC-W style workload
Challenge is user mode scheduler
TPC-C workload
Other applications on the way
Large-scale
“Real” systems with “real” performance problems
12 November 2003
Conclusion
Magpie is a tool for performance analysis in a
distributed system
Bottom up, per-request approach
Complementary to existing techniques:
Performance counters
Program profiling
Feeds into performance debugging and
prediction tools
12 November 2003