Magpie: Distributed request tracking for realistic performance modelling Rebecca Isaacs Paul Barham Richard Mortier Dushyanth Narayanan Microsoft Research Cambridge James Bulpin University of Cambridge 12 November 2003

Download Report

Transcript Magpie: Distributed request tracking for realistic performance modelling Rebecca Isaacs Paul Barham Richard Mortier Dushyanth Narayanan Microsoft Research Cambridge James Bulpin University of Cambridge 12 November 2003

Magpie: Distributed request
tracking for realistic
performance modelling
Rebecca Isaacs
Paul Barham
Richard Mortier
Dushyanth Narayanan
Microsoft Research Cambridge
James Bulpin
University of Cambridge
12 November 2003
Performance in distributed systems

Faults in distributed systems are notoriously hard to
diagnose

Performance problems are even more subtle to
debug



Often transient or affect only a subset of requests / users
Frequently involve complex interactions between multiple
machines
Aggregate statistics (e.g. utilization) may look perfectly
normal
12 November 2003
Magpie Approach

Track individual requests end to end



Observe control flow (causality)
Monitor resource consumption: CPU, bandwidth, disk
 Debug performance “in the small”
Build a probabilistic workload model from the
aggregate requests

Cluster similar requests according to their observed
behaviour
 Debug performance “in the large”
12 November 2003
How do we use this information?

Performance debugging




Why did this request take much longer than that
request?
Fault detection
Configuration and management
Performance prediction


Realistic workload models for capacity planning
Obtain automatically on a “live” system
12 November 2003
Magpie components

Instrumentation


Generic request parser


System activity recorded to logs
Extract individual requests from logs according to
an event schema
Model construction


Behavioural clusters
Probabilistic state machine
12 November 2003
Outline






Introduction
What is a request?
Instrumentation
Request extraction
Modelling
Current status
12 November 2003
What is a request?

System activity which takes place in response to an
action initiated by the application being traced




HTTP request
Database query
File open request
We describe a request as


The sequence of application components involved in its
processing
The resource consumed at each stage
 CPU, bandwidth, disk transfer size, (latency)
12 November 2003
A typical e-commerce site (1)
Internet
Storage
SQL Servers
Web Front Ends
12 November 2003
A typical e-commerce site (2)
SQL Server
Web Server
CLR
Application
Logic
Filter
IIS
Static
Content
ASP.NET
Stored
procedures
ADO.NET
Data
WinSock2 API
http.sys
12 November 2003
Kernel
WinSock2 API
Kernel
HTTP request: detailed view
ASP.NET thread blocks after
RPC to database
Sync WinSock send
to SQL Server
IIS worker thread
wakes up to write log
!
IIS worker thread
picks up request
from http.sys
-
+
-
+
-
+
-
+
WEB.eec
WEB.398
HTTP
request Disk
packet
Net RX
ASP.NET worker
thread takes over
HTTP response packets
sent back to client
TDS request and reply
packets sent and
received
Net TX
10.051s
10.100s
10.155s
10.100s
10.155s
Net TX
SQL.9c4
-
SQL thread
unblocks
-
Disk
-
Net RX
10.051s
KEY:
12 November 2003
Blocked
IIS
ASP.NET
SQL
Disk
Other
Why is request tracking hard?

Many components, multiple machines


No globally unique request ID


Many threads participate in processing a request
Asynchronous communication


Components are developed independently
Multiple thread pools


Must track control flow across machines
Must match send/recvs between threads/machines
Hand-rolled synchronization primitives

SQL server has user-mode scheduler
12 November 2003
Outline






Introduction
What is a request?
Instrumentation
Request extraction
Modelling
Current status
12 November 2003
Event Tracing for Windows

Low-overhead event mechanism




Events timestamped with cycle counter
Global ordering on events on a single machine
Can enable/disable sets of events at runtime
Using ETW in Magpie




Each instrumentation point posts an event
Events are logged to disk
Logs are post-processed to extract requests
Can also consume events in real time
12 November 2003
Instrumentation points

Existing ETW event providers


App-specific hooks


IIS, ASP.NET, SQL Server
Detours


IIS, kernel
Wrap dlls to trap Win32 and WinSock2 calls
WinPcap

Capture packets on the wire
12 November 2003
CPU usage from kernel events

The ETW kernel logger records every context
switch


How do we know which cycles are used for which
request?
We can attribute cycles to a request by


An application-specific event which occurs within
a delimited sector of CPU time, or
The current context of execution, eg thread id
12 November 2003
Example: protocol processing
in a DPC
Events:
DPC
cswitch start
Request 1
cycle count
Request 2
cycle count
12 November 2003
pkt
recv
DPC
end
cswitch
time
Application and middleware
events


Cover points where flow of control moves
between components
Cover points where resources are
multiplexed and demultiplexed


E.g. user-level scheduling primitives
Propagation of a global request id is not
required!

Magpie used to do this but not any more
12 November 2003
Instrumenting a web service
SQL Server
Web Server
CLR
Static
Content
Wrappers
Application
Logic
Filter
IIS
HTTPModule
ASP.NET
ISAPI Filter
Data
CLR profiler
Intercept
WinSock2 API
Kernel
Kernel
Event Tracing for Windows
Packet
capture
12 November 2003
Stored
procedures
ADO.NET
Intercept
WinSock2 API
http.sys
Extended SPs
Event Tracing for Windows
Packet
capture
Outline






Introduction
What is a request?
Instrumentation
Request extraction
Modelling
Current status
12 November 2003
Generic request extraction

No inbuilt assumptions about the system or
the application


Schema specifies semantics of events


No common unique identifier
Easy to add new event types
Parser stitches events into requests based on
event semantics
12 November 2003
Terminology

Namespace


Timeline


Event parameter which references an entity in the
system, eg thread id
Instantiation of a namespace with a unique value,
eg thread id = 0xa
Events bind or unbind requests to timelines

Bindings capture the semantics of each event for
a particular request type
12 November 2003
cswitch
DPC end
TCP pkt
DPC start
cswitch
Enter Recv
Recv returns
Example: connecting events
Cpuid=0
Tid=0xa
Tid=0xb
Connid=0xd
12 November 2003
Request 1
Request 2
End-to-end request extraction

An instance of the request parser runs on
each machine in the distributed system


Online or offline mode
Offline post-processing connects request
fragments from each node according to a
globally unique namespace, e.g. packet IP
identifier
12 November 2003
Outline






Introduction
What is a request?
Instrumentation
Request extraction
Modelling
Current status
12 November 2003
Clustering for workload
generation

Target the Indy performance modelling tool



Previously: microbenchmark approach



Calculates throughput, bottlenecks
Needs transaction mix, resource consumption
Run 10000 of each “transaction type” (URL)
Divide aggregate resource usage by 10000
Aim: provide realistic workload models


From real, mixed workloads
Derive transaction “types” automatically
12 November 2003
Single request: cartoon view


Partial ordering of events
Annotated with resource usage
1k
12k
5ms
6ms
3ms
1k
192k
read
1ms
6ms
3ms
SQL Server CPU
ASP.NET CPU
Network
12 November 2003
6ms
6k
2ms
IIS CPU
24k
read
Disk
Behavioural clustering of
requests

Represent requests as event strings


Use Levenshtein string edit distance


Modified to factor in resource usage vectors
Cluster requests based on this distance


“Flatten” out any concurrency
Linear-time algorithm
Each cluster is a request “type”

Select representative from near centroid
12 November 2003
Build a workload model by
clustering similar requests
1k
30k
A
7%
Requests in the same
cluster often have
different URLs, and
one URL may appear
in many clusters
2ms 10ms
0.2k
5ms
0.1k
14ms
0.2k
5ms
6k
11ms
5ms
24ms 1ms
0.2k
2k
5ms
1k
11k
B
14ms 27ms 2ms
1ms
2ms
7ms
10%
A
E
B
1k
12k
C
C
5ms
6ms
1k
6k
2ms
15%
D
3ms
192k
read
3ms
1k
11k
D
5%
0.6k
1k
12 November 2003
E
63%
5ms 11ms
2ms 13ms
0.3k
11ms 3ms 2ms 5ms
0.3k
5ms
24k
read
6ms
1ms
6ms
Taking it further: work-inprogress

Online and incremental modelling:



More sophisticated models



Detect component failure
Detect sudden shifts in workload
Learn the probabilistic state machine for each request
c.f. flowcharts annotated with performance information
“Bayesian watchdogs”


Compute the likelihood of a request’s behaviour as it
moves through the system
Deal with “unlikely” requests appropriately
12 November 2003
Outline






Introduction
What is a request?
Instrumentation
Request extraction
Modelling
Current status
12 November 2003
Current status

Recent focus has been developing a generic
request extraction scheme

Prototype for 2-machine e-commerce site


Prototype for single machine SQL Server 2000



TPC-W style workload
Challenge is user mode scheduler
TPC-C workload
Other applications on the way


Large-scale
“Real” systems with “real” performance problems
12 November 2003
Conclusion



Magpie is a tool for performance analysis in a
distributed system
Bottom up, per-request approach
Complementary to existing techniques:



Performance counters
Program profiling
Feeds into performance debugging and
prediction tools
12 November 2003