Magpie: Distributed request tracking for realistic performance modelling Rebecca Isaacs Paul Barham Richard Mortier Dushyanth Narayanan Microsoft Research Cambridge James Bulpin University of Cambridge 12 November 2003
Download ReportTranscript Magpie: Distributed request tracking for realistic performance modelling Rebecca Isaacs Paul Barham Richard Mortier Dushyanth Narayanan Microsoft Research Cambridge James Bulpin University of Cambridge 12 November 2003
Magpie: Distributed request tracking for realistic performance modelling Rebecca Isaacs Paul Barham Richard Mortier Dushyanth Narayanan Microsoft Research Cambridge James Bulpin University of Cambridge 12 November 2003 Performance in distributed systems Faults in distributed systems are notoriously hard to diagnose Performance problems are even more subtle to debug Often transient or affect only a subset of requests / users Frequently involve complex interactions between multiple machines Aggregate statistics (e.g. utilization) may look perfectly normal 12 November 2003 Magpie Approach Track individual requests end to end Observe control flow (causality) Monitor resource consumption: CPU, bandwidth, disk Debug performance “in the small” Build a probabilistic workload model from the aggregate requests Cluster similar requests according to their observed behaviour Debug performance “in the large” 12 November 2003 How do we use this information? Performance debugging Why did this request take much longer than that request? Fault detection Configuration and management Performance prediction Realistic workload models for capacity planning Obtain automatically on a “live” system 12 November 2003 Magpie components Instrumentation Generic request parser System activity recorded to logs Extract individual requests from logs according to an event schema Model construction Behavioural clusters Probabilistic state machine 12 November 2003 Outline Introduction What is a request? Instrumentation Request extraction Modelling Current status 12 November 2003 What is a request? System activity which takes place in response to an action initiated by the application being traced HTTP request Database query File open request We describe a request as The sequence of application components involved in its processing The resource consumed at each stage CPU, bandwidth, disk transfer size, (latency) 12 November 2003 A typical e-commerce site (1) Internet Storage SQL Servers Web Front Ends 12 November 2003 A typical e-commerce site (2) SQL Server Web Server CLR Application Logic Filter IIS Static Content ASP.NET Stored procedures ADO.NET Data WinSock2 API http.sys 12 November 2003 Kernel WinSock2 API Kernel HTTP request: detailed view ASP.NET thread blocks after RPC to database Sync WinSock send to SQL Server IIS worker thread wakes up to write log ! IIS worker thread picks up request from http.sys - + - + - + - + WEB.eec WEB.398 HTTP request Disk packet Net RX ASP.NET worker thread takes over HTTP response packets sent back to client TDS request and reply packets sent and received Net TX 10.051s 10.100s 10.155s 10.100s 10.155s Net TX SQL.9c4 - SQL thread unblocks - Disk - Net RX 10.051s KEY: 12 November 2003 Blocked IIS ASP.NET SQL Disk Other Why is request tracking hard? Many components, multiple machines No globally unique request ID Many threads participate in processing a request Asynchronous communication Components are developed independently Multiple thread pools Must track control flow across machines Must match send/recvs between threads/machines Hand-rolled synchronization primitives SQL server has user-mode scheduler 12 November 2003 Outline Introduction What is a request? Instrumentation Request extraction Modelling Current status 12 November 2003 Event Tracing for Windows Low-overhead event mechanism Events timestamped with cycle counter Global ordering on events on a single machine Can enable/disable sets of events at runtime Using ETW in Magpie Each instrumentation point posts an event Events are logged to disk Logs are post-processed to extract requests Can also consume events in real time 12 November 2003 Instrumentation points Existing ETW event providers App-specific hooks IIS, ASP.NET, SQL Server Detours IIS, kernel Wrap dlls to trap Win32 and WinSock2 calls WinPcap Capture packets on the wire 12 November 2003 CPU usage from kernel events The ETW kernel logger records every context switch How do we know which cycles are used for which request? We can attribute cycles to a request by An application-specific event which occurs within a delimited sector of CPU time, or The current context of execution, eg thread id 12 November 2003 Example: protocol processing in a DPC Events: DPC cswitch start Request 1 cycle count Request 2 cycle count 12 November 2003 pkt recv DPC end cswitch time Application and middleware events Cover points where flow of control moves between components Cover points where resources are multiplexed and demultiplexed E.g. user-level scheduling primitives Propagation of a global request id is not required! Magpie used to do this but not any more 12 November 2003 Instrumenting a web service SQL Server Web Server CLR Static Content Wrappers Application Logic Filter IIS HTTPModule ASP.NET ISAPI Filter Data CLR profiler Intercept WinSock2 API Kernel Kernel Event Tracing for Windows Packet capture 12 November 2003 Stored procedures ADO.NET Intercept WinSock2 API http.sys Extended SPs Event Tracing for Windows Packet capture Outline Introduction What is a request? Instrumentation Request extraction Modelling Current status 12 November 2003 Generic request extraction No inbuilt assumptions about the system or the application Schema specifies semantics of events No common unique identifier Easy to add new event types Parser stitches events into requests based on event semantics 12 November 2003 Terminology Namespace Timeline Event parameter which references an entity in the system, eg thread id Instantiation of a namespace with a unique value, eg thread id = 0xa Events bind or unbind requests to timelines Bindings capture the semantics of each event for a particular request type 12 November 2003 cswitch DPC end TCP pkt DPC start cswitch Enter Recv Recv returns Example: connecting events Cpuid=0 Tid=0xa Tid=0xb Connid=0xd 12 November 2003 Request 1 Request 2 End-to-end request extraction An instance of the request parser runs on each machine in the distributed system Online or offline mode Offline post-processing connects request fragments from each node according to a globally unique namespace, e.g. packet IP identifier 12 November 2003 Outline Introduction What is a request? Instrumentation Request extraction Modelling Current status 12 November 2003 Clustering for workload generation Target the Indy performance modelling tool Previously: microbenchmark approach Calculates throughput, bottlenecks Needs transaction mix, resource consumption Run 10000 of each “transaction type” (URL) Divide aggregate resource usage by 10000 Aim: provide realistic workload models From real, mixed workloads Derive transaction “types” automatically 12 November 2003 Single request: cartoon view Partial ordering of events Annotated with resource usage 1k 12k 5ms 6ms 3ms 1k 192k read 1ms 6ms 3ms SQL Server CPU ASP.NET CPU Network 12 November 2003 6ms 6k 2ms IIS CPU 24k read Disk Behavioural clustering of requests Represent requests as event strings Use Levenshtein string edit distance Modified to factor in resource usage vectors Cluster requests based on this distance “Flatten” out any concurrency Linear-time algorithm Each cluster is a request “type” Select representative from near centroid 12 November 2003 Build a workload model by clustering similar requests 1k 30k A 7% Requests in the same cluster often have different URLs, and one URL may appear in many clusters 2ms 10ms 0.2k 5ms 0.1k 14ms 0.2k 5ms 6k 11ms 5ms 24ms 1ms 0.2k 2k 5ms 1k 11k B 14ms 27ms 2ms 1ms 2ms 7ms 10% A E B 1k 12k C C 5ms 6ms 1k 6k 2ms 15% D 3ms 192k read 3ms 1k 11k D 5% 0.6k 1k 12 November 2003 E 63% 5ms 11ms 2ms 13ms 0.3k 11ms 3ms 2ms 5ms 0.3k 5ms 24k read 6ms 1ms 6ms Taking it further: work-inprogress Online and incremental modelling: More sophisticated models Detect component failure Detect sudden shifts in workload Learn the probabilistic state machine for each request c.f. flowcharts annotated with performance information “Bayesian watchdogs” Compute the likelihood of a request’s behaviour as it moves through the system Deal with “unlikely” requests appropriately 12 November 2003 Outline Introduction What is a request? Instrumentation Request extraction Modelling Current status 12 November 2003 Current status Recent focus has been developing a generic request extraction scheme Prototype for 2-machine e-commerce site Prototype for single machine SQL Server 2000 TPC-W style workload Challenge is user mode scheduler TPC-C workload Other applications on the way Large-scale “Real” systems with “real” performance problems 12 November 2003 Conclusion Magpie is a tool for performance analysis in a distributed system Bottom up, per-request approach Complementary to existing techniques: Performance counters Program profiling Feeds into performance debugging and prediction tools 12 November 2003