Magpie: Distributed Profiling for Performance Analysis Paul Barham (joint work with Rebecca Isaacs and Richard Mortier) 11th November 2002

Download Report

Transcript Magpie: Distributed Profiling for Performance Analysis Paul Barham (joint work with Rebecca Isaacs and Richard Mortier) 11th November 2002

Magpie:
Distributed Profiling for
Performance Analysis
Paul Barham
(joint work with Rebecca Isaacs
and Richard Mortier)
11th November 2002
What is Magpie?
• Bottom-up approach to characterising the workload of
a distributed system
• Observe concurrency, communication & latency
– Focus is on responsiveness, rather than throughput
• Resource consumption is accounted to individual
service invocations
– CPU, disk and network b/w consumed by each request on
each machine in the distributed system
– “Low level accounting + higher level annotations”
– Online measurement, offline processing
• Why? Web Services = “wide area synchronous RPC”
Motivation: Indy Performance
Modelling Toolkit
• Indy is a hybrid event/throughput based modelling kernel
– Inputs: Topology, hardware, workload
– Outputs: Utilizations, response times, bottlenecks, etc.
• Scope is multi-tier server farms running .NET web sites
• Our goal is to use Magpie to derive a generative model of the
system workload suitable for input to Indy
– Acquire a workload description with less human effort / knowledge than
current approach of “incremental micro-benchmarking”
– Extract a detailed model from a live, ‘representative’ system
• Measure with a realistic mix of transaction types – caches!
• Not just a long-term average across all transactions
– Build a probabilistic model of the usage profile which includes “hidden”
transaction types, eg error conditions, session state
• Complex behaviour may not be easily observable manually
• e.g. web transaction type discriminator is not necessarily the URL
Outline of rest of talk
1. Instrumenting a .NET web site
2. Some raw data and visualizations
3. Data analysis by clustering
4. Current status and future work
Duwamish7: VS.Net Enterprise Sample Site
10,000ft view of an e-commerce site
Client
Web Server
SQL Server
A little bit more detail…
Web Server(s)
Client(s)
SQL Server(s)
CLR / Jit Compiler
• Site-specific content and code shown in purple
• Requests can be either static content, or active – execute code to
generate HTML
• Code may execute SQL queries on database.
Cache
Stored
Procs
DBMS
Business
Logic
ADO.Net
front
end
ASP.Net
Cache
w3wp
Internal Architecture of IIS6 / ASP.NET
Application
aspnet_isapi.dll
ADO.Net
ASP.Net
Winsock
CLR
– Per-“site” worker proc
– ISAPI extension hosts CLR
– ASP.Net, ADO.Net
• Threadpools
Type-specific
Handlers
DONE
RESPONSE
CGI
Per-Request
FSMs
W3_MAIN_CONTEXT
• http.sys
– aka “Universal listener”
– Kernel-mode driver
– “TransmitFile() on steroids”
– HTTP fragment cache
AUTHORIZ
– Uses single IOCompletionPort
– Demux using “OVERLAPPED”s
ISAPI
AUTHENT
• All async I/O
URLINFO
– One per-request, event driven
– Non-blocking upcalls
Static
START
• Nested state machines
LOG
– One for CLR, one for IIS
Common
Language
Runtime
ISAPI
“ECB”
HANDLE
• w3wp.exe
aspnet_isapi.dll
Main W3
Threadpool
ULATQ
HTTPAPI
Winsock
User
I/O Compn Port
http.sys
TCP/IP
Kernel
Low Level Profiling
Kernel Tracing
• Windows XP has very efficient low-level event tracing
built in to the kernel (“ETW”)
• Command-line tools for turning on or off tracing of
specific system activities
• Magpie uses ETW on each machine to capture
–
–
–
–
–
–
Every context switch
File IO
Disk IO
Network send and receive completions
Process and thread creation and deletion
Page faults
• Overhead is surprisingly small! (~8MB log for 5 mins)
Stitching together Asynchronous
I/O and Thread Synchronization
Detours
– Public tool written by Galen Hunt (MSR)
– Can be used to intercept calls to any DLL and record
their arguments etc.
• Magpie usage:
– Intercept calls to Winsock2 API
• Allows communication to be followed (WSASend, WSARecv)
– Intercept the HTTPAPI
• Follows the processing of HTTP requests by http.sys
– Intercept certain Win32 API calls
• Observe thread synchronization (e.g. SetEvent, Waits)
• Follow async I/O completions (GetQueuedCompletionStatus)
• Directly observe threadpool.
Higher Level Annotations
ISAPI filter
–
–
–
–
Extension DLLs loaded into IIS (web server) process
Sees all HTTP requests
Filter registers with IIS to receive particular event notifications
Can examine and modify both incoming and outgoing streams of
data
• Magpie ISAPI filter:
– Allocates a unique identifier to each incoming request and adds it to
the HTTP headers
– Generates additional trace events on entry and exit
• Records cycle counter, request ID & resource usage
– “Thread 274 is now working on Request CCC00017”
Following the active content
HTTPModule
–
–
–
–
Plugins for ASP.NET extensibility
Sees all active requests
Executes in the “managed” (CLR) environment
Each active request is processed by multiple
HTTPModules, e.g. session, authentication, etc.
• Magpie HTTPModule:
– Stashes request identifier in (managed) thread local state
– Records cycle counter, managed thread id, request ID &
resource usage
Accounting Database Activity
SQL Profiler
– Standard part of SQL Server 2000 distribution
– Logs selected events to table or file
• (events can be user defined)
• Magpie SQL Profiling
– “Wraps” original stored procedures using reflection
– Wrappers generate user-defined profiler events
before and after executing the original stored proc.
• Recorded by the SQL Profiler in output trace
• Data includes request ID, cycle counter + resource usage
– Uses extended stored procedure to get cycle counter
+ resource usage stats
Intercepting Outgoing RPCs
Common Language Runtime Profiler
• Two COM interfaces:
– ICorProfiler: provides notifications, e.g. function enter/leave,
module/class load, thread mappings, .Net remoting, JIT compilation
– ICorProfilerCallback: Runtime provides API which allows profiler to
examine and modify VM state.
• Magpie CLR Profiler:
– Monitors CLRgOS thread mappings
• Records thread ids, cycle counter + resource usage
– Intercepts JIT compilation of relevant ADO.NET classes/methods
• Rewrites byte-code (IL) to insert calls to our own profiling functions
• Modifies SQL stored procedure invocations to call wrapped
versions and pass an extra argument (i.e. request ID)
• Lots of fun but very hairy code! 
Putting it all together…
SQL Server
ASP.NET
ASP.NET
IIS
IIS
SQL Profiler
Wrappers
App Logic
Rewritten
IL
ADO.NET
ISAPI Filter
HTTPModule
Web Server
CLR
CLR
Stored Exten
-ded
Procs SPs
DBMS
DBMS
CLR Profiler
& IL Patcher
Intercept
HTTP API
Intercept
Intercept
WinSock2 API
WinSock2 API
HTTP
Pkt
Capture
Tap
http.sys
Kernel
PerfInfo
TDS
Pkt
Capture
Tap
Pkt
Capture
Tap
Kernel
PerfInfo
Userdefined
events
Raw Event Data – Yuk!
Logs sorted
together using
cycle counter
IIS worker
thread picks
up request
from http.sys
(async I/O
Completion)
ASP worker
thread takes
over
blocking wait
for reply from
SQL Server
incoming HTTP request packet
...
574530113665473L 157.58.60.98.3173 > 192.168.187.66.http: P 9987:10368(381) \\
ack 80470 win 64240 (DF) (ttl 127, id 27528, len 421)
574530113724014 TcpReceive svchost.exe 11c 381 192.168.187.066:80 157.058.060.098:3173
...
574530114445762 f4c + HttpReceiveHttpRequest(ReqQueueHandle=1c0, RequestId=0, \\
Flags=1, pRequestBuffer=faae48, RequestBufferLength=9d0, pBytesReceived=0,
pOverlapped=faae14) 0
574530114613968 f4c - HttpReceiveHttpRequest() -> 3e5
574530114724562 f4c + GetQueuedCompletionStatus(1b8,f5ff84,f5ff8c,f5ff98,0)
574530114836287 f4c - GetQueuedCompletionStatus(,,,,) -> 0 5a8d11fc 0
574530114947322 f4c + GetQueuedCompletionStatus(1b8,f5ff84,f5ff8c,f5ff98,1b7740)
574530115070456 CSwitch 0 w3wp.exe 658 f4c Waiting WrQueue System 4 e50 3913680
574530115394668 CSwitch 0 System 4 e50 Ready - w3wp.exe 658 f4c 324212
574530115456005 f4c - GetQueuedCompletionStatus(,,,,) -> 1 5a8d11fc fab854
574530115683276 f4c ! HttpReceiveHttpRequest Ov: fab854 CID: 40000284e2000000 \\
ReqID: 40000284e2000000 LA: 192.168.187.66:80 RA: 157.58.60.98:3173 \\
IndyProf label
Bytes: 0 Flags: 0 Verb: 4 RawUrl: /duwamish7/categories.aspx?ID=843
574530115882433 12673200697.357422 ccc00663 f4c /duwamish7/categories.aspx?ID=843
574530116116697 f4c + GetQueuedCompletionStatus(1b8,f5ff84,f5ff8c,f5ff98,0)
...
574530116959382 CSwitch 0 w3wp.exe 658 f4c Waiting WrQueue w3wp.exe 658 6dc 1564714
574530117373185 12673200697.3574 ccc00663 6dc /duwamish7/categories.aspx?ID=843
HttpModule
...
label
574530119411731 6dc + send(s=4f0, buf=2739044, len=7f, flags=0)
574530119656299L 192.168.187.66.4681 > 192.168.187.68.ms-sql-s: P 23578:23705(127) \\
ack 227230 win 16735 (DF) (ttl 128, id 10136, len 167)
send request to SQL Server
574530119721688 6dc - send() -> 7f
...
574530120169086 6dc + WSARecv(s=4f0, lpBuffers=499f4fc 16384, dwBufferCount=1, \\
lpNumberOfBytesRecvd=499f514, lpFlags=499f510, lpOverlapped=0 0, \\
lpCompletionRoutine=0)
574530120325932 CSwitch 0 w3wp.exe 658 6dc Waiting UserRequest w3wp.exe 658 ec8 3366550
...
Visualisation Tools – PowerPoint Macros!
+WSARecv
-send
Sync WinSock
send to DB
+send
WEB.f4c
WEB.6dc
Disk
TDS request
packet sent
and received
Net RX
Net TX
Net TX
Net RX
Disk
SQL:StartProf
48.48224443s
-WSARecv
HTTP
Request
packet from
client
HttpModule label
Htt pMo dBegin
+GetQ ue ue dCo mp
Ind yProfReq
+GetQ ue ue dCo mp
!Htt pReceive Ht
IndyProf label
-GetQ ue ue dCo mp
IIS worker thread
picks up request
from http.sys
SQL thread
unblocks
SQL.a34
Time/s
48.48224443s
KEY:
blocked
IIS
ASP.NET
SQL
Disk
Unaccounted
Part of the visualisation of Transaction ccc00663:/duwamish7/categories.asp?ID=843
SQL
profiler
trace
label
Disk
KEY:
Blocked
IIS
ASP.NET
10.051s
SQL
Disk
Other
SQL:EndProf
-WSARecv
SQL:StartProf
SQL:EndProf
-WSARecv
SQL:StartProf
SQL:EndProf
-WSARecv
SQL:StartProf
WEB.eec
-WSARecv
+send
-send
+WSARecv
-WSARecv
+send
-send
+WSARecv
-WSARecv
+send
-send
+WSARecv
HttpModBegin
!HttpReceiveHt
IndyProfReq
Transaction ccc000b9: /duwamish7/categories.aspx?ID=831
WEB.398
Disk
Net RX
Net TX
10.051s
10.100s
Net TX
Net RX
SQL.9c4
10.100s
Interleaving of Simultaneous Requests
Transactions between 28.8s and 31.0s
WEB.cc8
WEB.cc4
WEB.b90
WEB.ad8
WEB.4b0
Disk
Net RX
Net TX
28.84s
Net TX
Net RX
Disk
SQL.6cc
SQL.e84
28.84s
(Each colour is a different transaction, grey = blocked)
Net RX
Disk
KEY:
SQL:EndProf
-WSARecv
SQL:StartProf
ContextSwitchI
ContextSwitchI
SQL:EndProf
-WSARecv
SQL:StartProf
ContextSwitchI
ContextSwitchI
blocked
18.3357530767s
IIS
Net RX
SQL.9d8
18.4334278667s
WEB.f4c
WEB.474
Disk
Disk
Net RX
Net RX
Disk
ASP.NET
SQL
Disk
49.0598638833s
49.0598638833s
Other
Transaction ccc0067f: / d u wa mish7/ b ook.aspx?ID=58
-HttpReceiveHt
+GetQueuedComp
ContextSwitchI
-HttpSendRespo
+HttpReceiveHt
IndyProfResp
+HttpSendRespo
-GetQueuedComp
Transaction ccc000b3: / d u wamish7 /images / banner / line.gif
ContextSwitchI
+GetQueuedComp
-GetQueuedComp
+GetQueuedComp
-HttpReceiveHt
+HttpReceiveHt
-HttpSendRespo
+HttpSendRespo
-WaitForSingle
+WaitForSingle
-WaitForSingle
IndyProfResp
+WaitForSingle
-WaitForSingle
+WaitForSingle
-GetQueuedComp
+GetQueuedComp
-HttpSendHttpR
ContextSwitchI
-HttpReceiveHt
+GetQueuedComp
-GetQueuedCom p
+GetQueuedComp
-HttpSen dRespo
+HttpReceiveHt
+WaitForSingle
-WaitForSingle
+HttpSen dRespo
+WaitForSingle
-WaitForSingle
IndyPr ofResp
+WaitForSingle
-WaitForSingle
-GetQueuedCom p
ContextSwitchI
-HttpSen dRespo
+PostQueuedCom
-PostQueuedCom
+GetQueuedComp
-GetQueuedCom p
+GetQueuedComp
ContextSwitchI
+GetQueuedComp
-GetQueuedCom p
+GetQueuedComp
+HttpSen dRespo
+WaitForSingle
-WaitForSingle
ContextSwitchI
+WaitForSingle
-WaitForSingle
HttpModEn d
+WaitForSingle
-WaitForSingle
+WaitForSingle
-WaitForSingle
+WaitForSingle
-WaitForSingle
HttpModBegin
ContextSwitchI
-SetEvent
+SetEvent
+GetQueuedComp
-GetQueuedCom p
+WaitForSingle
-WaitForSingle
+WaitForSingle
-WaitForSingle
IndyPr ofReq
+WaitForSingle
-WaitForSingle
!HttpReceiveHt
-GetQueuedCom p
IndyProfResp
-HttpSendRespo
+HttpSendRespo
HttpModEnd
-WSARecv
+send
-send
+WSARecv
Transaction ccc000ba: / d u wamish7 /book.aspx?ID=37816
ContextSwitchI
-HttpSendRespo
+PostQueuedCom
-PostQueuedCom
ContextSwitchI
+HttpSendRespo
ContextSwitchI
+HttpSendHttpR
-WaitForSingle
+WaitForSingle
-WaitForSingle
IndyProfReq
+WaitForSingle
-WaitForSingle
+WaitForSingle
!HttpReceiveHt
-GetQueuedComp
SQL:EndProf
-WSARecv
SQL:StartProf
-WSARecv
+send
-send
+WSARecv
-WSARecv
+send
-send
+WSARecv
WEB.a18
HttpModEnd
22.6001556s
ContextSwitchI
-WSARecv
Net TX
+WSARecv
ContextSwitchI
+GetQueuedComp
-GetQueuedComp
+GetQueuedComp
22.6001556s
SQL:EndProf
Net RX
-send
Net RX
+send
Disk
Disk
ContextSwitchI
WEB.848
IndyProfReq
10.1559640767s
HttpModBegin
SQL.9c4
!HttpReceiveHt
-GetQueuedComp
10.0513824767s
ContextSwitchI
Transaction ccc000b9:/duwamish7/book.aspx?ID=37816
ContextSwitchI
-HttpSendRespo
+PostQueuedCom
-PostQueuedCom
-GetQueuedComp
+WaitForSingle
-WaitForSingle
IndyProfResp
+WaitForSingle
-WaitForSingle
+WaitForSingle
-WaitForSingle
+HttpSendRespo
-HttpSendRespo
+HttpReceiveHt
-HttpReceiveHt
+GetQueuedComp
+WaitForSingle
-WaitForSingle
HttpModEnd
+WaitForSingle
-WaitForSingle
+WaitForSingle
-WaitForSingle
+HttpSendRespo
ContextSwitchI
DiskRead
ContextSwitchI
DiskRead
ContextSwitchI
DiskRead
+WaitForSingle
-WaitForSingle
ContextSwitchI
ContextSwitchI
-WSARecv
SQL:EndProf
-WSARecv
SQL:StartProf
SQL:EndProf
SQL:StartProf
-WSARecv
Net RX
SQL:StartProf
ContextSwitchI
ContextSwitchI
18.3357530767s
-HttpSendRespo
+HttpReceiveHt
-HttpReceiveHt
+GetQueuedComp
-GetQueuedCom p
+GetQueuedComp
ContextSwitchI
-HttpSendRespo
+PostQueuedCom
-PostQueuedCom
-GetQueuedCom p
IndyProfResp
+HttpSendRespo
+HttpSendRespo
HttpModEnd
ContextSwitchI
ContextSwitchI
-WSARecv
+WSARecv
-send
+send
ContextSwitchI
DiskRead
-WaitForSingle
+SetEvent
-SetEvent
SQL:EndProf
DiskRead
-WaitForSingle
ContextSwitchI
+WaitForSingle
DiskRead
-WaitForSingle
ContextSwitchI
+WaitForSingle
+WaitForSingle
ContextSwitchI
-WaitForSingle
+WaitForSingle
-WaitForSingle
+send
-send
+WSARecv
+GetQueuedComp
-GetQueuedComp
+GetQueuedComp
10.0513824767s
-WSARecv
Transaction ccc001a6: / d u wamish7/ c ategories.aspx?ID=836
ContextSwitchI
-WSARecv
+WSARecv
-send
+send
ContextSwitchI
ContextSwitchI
ContextSwitchI
-WSARecv
+WSARecv
-send
+send
Disk
SQL:StartProf
+WaitForSingle
-WaitForSingle
-SetEvent
ContextSwitchI
+SetEvent
+WaitForSingle
-WaitForSingle
ContextSwitchI
-SetEvent
+SetEvent
ContextSwitchI
+WaitForSingle
Net RX
HttpModBegin
Disk
+GetQueuedComp
-GetQueuedCom p
+GetQueuedComp
!HttpReceiveHt
IndyProfReq
HttpModBegin
WEB.eec
SQL:EndProf
-WSARecv
SQL:StartProf
ContextSwitchI
ContextSwitchI
WEB.6dc
-GetQueuedCom p
!HttpReceiveHt
IndyProfReq
WEB.a18
-GetQueuedComp
!HttpReceiveHt
+WaitForSingle
-WaitForSingle
IndyProfReq
+WaitForSingle
-WaitForSingle
+WaitForSingle
-WaitForSingle
+SetEvent
-SetEvent
ContextSwitchI
HttpModBegin
+WaitForSingle
-WaitForSingle
+WaitForSingle
-WaitForSingle
“An assortment of magical tastes”…
Transaction ccc000b9: / d u wamish7 /categories.aspx?ID=831
WEB.398
Disk
Net RX
WEB.848
Net TX
Disk
10.1559640767s
Net RX
Net TX
Net TX
Net TX
Net RX
Disk
22.7645571867s
22.8266142967s
22.7645571867s
22.8266142967s
WEB.a18
Net TX
Net TX
22.8630055267s
Net TX
SQL.130
Net RX
Disk
22.5109738067s
22.5302513033s
22.8630055267s
22.5109738067s
22.5302513033s
WEB.6dc
Net TX
Net TX
18.4334278667s
Net TX
49.13234228s
Net TX
SQL.a34
49.13234228s
Disk
10.0513824767s
SQL:EndProf
-WSARecv
SQL:StartProf
SQL:EndProf
10.0513824767s
IndyProfResp
-HttpSendRespo
+HttpSendRespo
SQL:EndProf
-WSARecv
SQL:StartProf
ContextSwitchI
ContextSwitchI
SQL:EndProf
-WSARecv
SQL:StartProf
ContextSwitchI
ContextSwitchI
ContextSwitchI
-HttpSendRespo
+HttpReceiveHt
-HttpReceiveHt
+GetQueuedComp
-GetQueuedCom p
+GetQueuedComp
ContextSwitchI
-HttpSendRespo
+PostQueuedCom
-PostQueuedCom
-GetQueuedCom p
IndyProfResp
+HttpSendRespo
+HttpSendRespo
HttpModEnd
ContextSwitchI
ContextSwitchI
-WSARecv
+WSARecv
-send
+send
ContextSwitchI
-WSARecv
+WSARecv
-send
+send
ContextSwitchI
ContextSwitchI
ContextSwitchI
-WSARecv
+WSARecv
-send
+send
HttpModBegin
Transaction ccc001a6: / d u wamish7/ c ategories.aspx?ID=836
HttpModEnd
-WSARecv
+send
-send
+WSARecv
-WSARecv
SQL:EndProf
-WSARecv
SQL:StartProf
ContextSwitchI
ContextSwitchI
Disk
+send
-send
+WSARecv
-WSARecv
Net RX
-WSARecv
SQL:StartProf
SQL:EndProf
+send
-send
+WSARecv
+GetQueuedComp
-GetQueuedCom p
+GetQueuedComp
-GetQueuedCom p
!HttpReceiveHt
IndyProfReq
WEB.6dc
SQL:StartProf
-WSARecv
WEB.eec
HttpModBegin
!HttpReceiveHt
IndyProfReq
How similar are two requests?
WEB.474
Disk
Net RX
Net TX
18.3357530767s
18.4334278667s
Net TX
SQL.9d8
18.3357530767s
Transaction ccc000b9: / d u wamish7 /categories.aspx?ID=831
18.4334278667s
WEB.398
Disk
Net RX
Net TX
10.1559640767s
Net TX
Net RX
SQL.9c4
10.1559640767s
Mining Some Structure…
Issues include:
•
•
•
•
•
Multiple clocks (at least one per machine)
Lots of concurrency
Only partial orders provided by network traffic
Noisy observations – preemptive scheduling
Aperiodic sampling – irregular “events”
Current approach…
• Borrowing algorithms from gene-sequence
comparison and speech recognition
• Construct a “string” representation of traces
• Cluster using variant of String Edit Distance
Levenshtein String Edit Distance
• d('', '') = 0
• d(s, '') = d('', s) = |s|
• d(s1+ch1, s2+ch2) =
min( d(s1, s2) + (ch1==ch2 ? 0 : 1),
d(s1+ch1, s2) + 1,
d(s1, s2+ch2) + 1 )
• Example:
appropriate m-eaning
|||||
|||||
|||
approximate matching
d(s1,s2)=7
• Can be computed in O(|s1|*|s2|) time using simple
dynamic programming algorithm
Partial ordering using packets…
Web
Server
HTTPREQ 46265 |
-GetQueuedCompletionStatus a18 |
!HttpReceiveHttpRequest a18 |
IndyProfReq a18 |
+GetQueuedCompletionStatus a18 |
CPU 4178433 Block |
-------------------------------|
HttpModBegin dc0 |
+ProcessRequestMain |
+OnStateChange |
-OnStateChange |
+FlushBuffer |
+send dc0 |
-------------------------------+------------------------------TDSREQ 23389 = 23389 TDSREQ
-send dc0 | Unblock
-FlushBuffer | SQL:StartProf 87c
+ReadNetlib | CPU 31605 Block
+WSARecv dc0 | Unblock
CPU 5038985 Block | CPU 31360 Block
ACK 23390 | Unblock
Unblock | +WaitForSingleObject 87c
CPU 13112 Block | -WaitForSingleObject 87c
| +SetEvent 87c
| CPU 33897 Block
| Unblock
| DiskRead 87c
| CPU 4020880
| +SetEvent 87c
| SQL:EndProf 87c
| CPU 182984
--------------------------------------------------------------TDSRESP 8819 = 8819 TDSRESP
Unblock |
-WSARecv dc0 |
-ReadNetlib |
+OnStateChange |
-OnStateChange |
SQL
Server
Example alphabet for trace strings…
A Distance Metric for Magpie Traces
1. Assign each Magpie instrumentation point a discrete label
2. Each trace entry has an 8-tuple of resource usage deltas
–
(Web CPU, Web DISK, WAN Rx, WAN Tx, LAN Rx, LAN Tx, SQL CPU, SQL DISK)
3. Deterministically flatten the partial order into a total order
4. Consider as a string of ‘weighted characters’, where weight
is the length of observation vector:
– e.g. !1 (0 {1 >5 [1 ]0 <1 }4 B0 b4 $0 )0 B0 b0 Q0 q0
5. Extend string-edit-distance to use normalised Euclidian
distance between between observation ‘vectors’
– Insert/delete cost = ||v||
“distance of point
from origin”
Substitution cost = ||v1-v2||
“distance between two
points in 8D space
Example String Edit Distances…
Trivial Clustering Algorithm…
• Doesn’t need to be very fancy… yet!
• Uses a ‘representative’ trace as cluster centroid
– Pick the best of 5 as in quicksort
• Compute distance from each trace to each
cluster centroid
– Add a “best-so-far” threshold to string-edit algorithm
– (dynamic programming algs are monotonic)
• Compare inter/intra-cluster mean distances to
decide when to create a new cluster
• Periodically move ‘singleton’ clusters to an
outliers list and try to merge back in at the end
• Approx O(N * C) - where C is #clusters
Typical clusters…
Centroid: !1(0R0r0)1B0b0Q0q0
0.000000
0.088417
0.003326
0.060546
0.013645
0.032970
0.048854
0.043589
0.106195
0.057646
0.025930
0.008057
...
static_ccc006c9
static_ccc00168
static_ccc00618
static_ccc006c3
static_ccc00616
static_ccc006c5
static_ccc006c4
static_ccc006c7
static_ccc0038e
static_ccc00043
static_ccc00556
static_ccc0038f
!1(0R0r0)1B0b0Q0q0
!1(0R0r0)1B0b0Q0q0
!1(0R0r0)1B0b0Q0q0
Centroid: !7(0{2>17[3]0<0}3B0b3$0)0B0b0Q0q0
0.000000 chkout_ccc0041c !7(0{2>18[3]0<0}3B0b3$0)0B0b0Q0q0
!1(0R0r0)1B0b0Q0q0
0.473224 chkout_ccc00092 !7(0{2>12[3]0<0}3B0b4$0)0B0b0Q0q0
!1(0R0r0)1B0b0Q0q0
0.273929 chkout_ccc00246 !7(0{2>16[3]0<0}3B0b4$0)1B0b0Q0q0
!1(0R0r0)0B0b0Q0q0
0.167314 chkout_ccc002a8 !7(0{2>17[3]0<0}3B0b3$0)0B0b0Q0q0
!1(0R0r0)1B0b0Q0q0
0.185258 chkout_ccc004be !7(0{2>17[3]0<0}3B0b3$0)0B0b0Q0q0
!1(0R1r0)1B0b0Q0q0
0.123095 chkout_ccc00675 !7(0{2>18[3]0<0}3B0b4$0)0B0b0Q0q0
0.100177 chkout_ccc005f4 !7(0{2>17[3]0<0}3B0b4$0)0B0b0Q0q0
!1(0R0r0)1B0b0Q0q0
0.318734 chkout_ccc001a5 !7(0{2>12[3]0<0}3B0b3$0)0B0b0Q0q0
!1(0R0r0)1B0b0Q0q0
0.049554 chkout_ccc00165 !7(0{2>17[3]0<0}3B0b3$0)0B0b0Q0q0
!1(0R0r0)1B0b0Q0q0
0.369109 chkout_ccc003ac !7(0{2>12[3]0<0}3B0b3$0)0B0b0Q0q0
!1(0R0r0)1B0b0Q0q0
...
Centroid: !1(0{2}5B0b4$0)0
0.000000
0.214118
0.204555
0.150912
0.019864
0.208676
0.212472
0.036158
0.210166
0.171975
0.288412
0.217915
0.238472
books_ccc00301
books_ccc00548
books_ccc006fd
books_ccc0079f
books_ccc00873
books_ccc00484
books_ccc0029f
books_ccc00842
books_ccc00517
books_ccc00589
books_ccc003d3
books_ccc0069c
books_ccc0076e
!1(0{2}5B0b4$0)0
!1(0{2}5B0b4$0)0
!1(0{2}5B0b4$0)0
!1(0{2}5B0b4$0)0
!1(0{2}5B0b4$0)0
!1(0{2}5B0b4$0)0
!1(0{2}5B0b4$0)0
!1(0{2}5B0b4$0)0
!1(0{2}5B0b4$0)0
!1(0{2}5B0b4$0)0
!1(0{4}5B0b4$0)0
!1(0{2}5B0b4$0)0
!1(0{2}5B0b3$0)0
Centroid: !1(0{1>4[1]0<1}4B0b3$0)0B0b0Q0q0
0.000000
0.396412
0.273409
0.069283
0.103720
0.044051
0.208350
0.147137
0.099524
0.244620
0.441445
0.116913
0.096342
...
books_ccc00748
logon_ccc005be
books_ccc005d4
books_ccc006e7
books_ccc00442
books_ccc00040
books_ccc000b9
books_ccc002cf
books_ccc0082b
books_ccc003ed
logon_ccc001e0
books_ccc00105
books_ccc00686
!1(0{1>5[1]0<1}4B0b4$0)0B0b0Q0q0
!2(0{2>5[1]0<0}4B0b3$0)0B0b0Q0q0
!1(0{1>4[1]0<1}4B0b5$0)1B0b0Q0q0
!1(0{1>4[1]0<1}4B0b4$0)0B0b0Q0q0
!1(0{1>4[1]0<1}4B0b3$0)0B0b0Q0q0
!1(0{1>4[1]0<1}4B0b3$0)0B0b0Q0q0
!1(0{1>4[1]0<1}5B0b4$0)0B0b0Q0q0
!1(0{1>4[1]0<1}4B0b4$0)0B0b0Q0q0
!1(0{1>4[1]0<1}4B0b4$0)0B0b0Q0q0
!1(0{1>7[1]0<1}5B0b4$0)0B0b0Q0q0
!2(0{2>5[1]0<0}4B1b3$0)0B0b0Q0q0
!1(0{1>4[1]0<1}4B0b4$0)0B0b0Q0q0
!1(0{1>4[1]0<1}4B0b4$0)0B0b0Q0q0
n2 pairwise distances
Transaction ID
Distance
Transaction ID
Visualisation of all Clusters
Distance
Transaction ID
Just the ‘active content’…
Transaction ID
A First-cut Workload Model…
• Just use cluster centroids and sizes to generate
‘similar’ transaction mix?
– Pretty good at capturing coarse differences
• e.g. Number of SQL requests
– Doesn’t deal with continuous distributions very well
• e.g. Cache/memory performance, zipf file size distributions
– But is it better than just using URL-based averages?
• Evaluation metric:
– Take the original trace, assume each transaction is replaced
by the centroid of its cluster and add up the RMS error.
– Evaluate with and without the ‘outlier’ clusters.
Evaluation of Magpie cluster-based model
• Results: per-resource RMS errors (across all transactions):
Method
Web CPU
URLs
Web DISK
WAN Rx
WAN Tx
LAN Rx
LAN Tx
SQL CPU
SQL DISK
48546337
1358
266
1958
127
339
12331844
206902
MAGPIE
7106121
1165
131
2053
1.3
37.7
5040474
11214
MAGPIE-O/L
7218741
1183
132
2080
1.32
38.34
5120613
11392
• RMS error improvement over just using URLs:
SQL DISK
SQL CPU
MAGPIE-O/L
MAGPIE
LAN Tx
URL
LAN Rx
WAN Tx
WAN Rx
Web DISK
Web CPU
0%
20%
40%
60%
80%
100%
120%
Better Models using Bayesian Learning?
Process Req
Blocked
S1
S2
Send Pkt
Web
S3
S4
Compute,
Disk IO
• Ongoing discussions with Michael Isard,
Mike Tipping and Chris Bishop
• Learn probabilistic models of resource
usage by different request types
• Construct the per-machine FSMs
• Possibly apply coupled hidden Markov
models (CHMMs)?
C
Tx
B
B
B
Rx
C
S2
S2
S3
S3
S3
S3
S4
S1
S1
S2
S2
S3
S1
S1
B
B
Rx
Tx
B
B
Receive Pkt
S1
S2
SQL
Waiting
S3
Send Pkt
(ignore fictitious details  )
C,D
time
Current Status…
• Using Matlab & Bayes Network Toolkit (BNT)
• Start by trying to fit simple HMM to just static
request cluster
– One discrete hidden state, 8 continuous observed
variables with assumed Gaussian distns.
– “Priors” computed from mean / var of observations
… unfortunately, none of the supplied learning algorithms
converge claiming our data is “infinitely improbable”!
Ongoing Work
• Investigating better ways of extracting models
from the data, esp. machine learning
• Transfer Magpie ideas/tools to the Indy team
• Use Magpie to learn parameters in the “live”
system order to calibrate processor, memory
system and cache models (more speculative)
• Exploring other types of distributed system,
e.g. GXA / Web Services v2  async messaging