ESX Performance Troubleshooting
Download
Report
Transcript ESX Performance Troubleshooting
ESX Performance Troubleshooting
VMware Technical Support
Broomfield, Colorado
Confidential
© 2009 VMware Inc. All rights reserved
What is slow performance?
•What does slow performance mean?
• Application responds slowly - latency
• Application takes longer time to do a job – throughput
Both related
to time
•Interpretation varies wildly
• Slower than expectation
• Throughput is low
• Latency is high
• Throughput, latency fine but uses excessive resources (efficiency)
•What are high latency, low throughput, and excessive
resource usage?
• These are subjective and relative
Bandwidth, Throughput, Goodput, Latency
Bandwidth vs. Throughput
• Higher Bandwidth does not guarantee Throughput.
• Low Bandwidth is a bottleneck for higher Throughput
Throughput vs. Goodput
• Higher Throughput does not mean higher Goodput
• Low Throughput is indicative of lower Goodput
Efficiency = Goodput/Bandwidth
Throughput vs. Latency
• Low Latency does not guarantee higher Throughput and vice versa
• Throughput or Latency alone can dominate performance
Bandwidth, Throughput, Goodput, Latency
Bandwidth
Latency
Goodput
Throughput
How to measure performance?
Higher throughput does not necessarily mean higher
performance – Goodput could be low
Throughput is easy to measure, but Goodput is not
How do we measure performance?
• Performance is actually never measured
• We could only quantify different metrics that affect performance.
These metrics describe the state of: CPU, memory, disk and
network
Performance Metrics
CPU
• Throughput: MIPS (%used), Goodput: useful instructions
• Latency: Instruction Latency (cache latency, cache miss)
Memory
• Throughput: MB/Sec, Goodput: useful data
• Latency: nanosecs
Storage
• Throughput: MB/Sec, IOPS/Sec, Goodput: useful data
• Latency: Seek time
Networking
• Throughput: MB/Sec, IO/Sec, Goodput: useful traffic
• Latency: microseconds
Hardware and Performance
CPU
• Processor Architecture: Intel XEON, AMD Opteron
• Processor cache – L1, L2, L3, TLB
• Hyperthreading
• NUMA
Hardware and Performance
Processor Architecture
• Clock Speeds from one architecture is not comparable with other
P-III outperforms P4 on a clock by clock basis
Opteron outperforms P4 on a clock by clock basis
• Higher clock speeds is not always beneficial
Bigger cache or better architecture may outperform higher clock speeds
• Processor memory communication is often the performance
bottleneck
Processor wastes 100’s of instruction cycles while waiting on memory
access
Caching alleviates this issue
Hardware and Performance
Processor Cache
• Cache reduces memory access latency
• Bigger cache increases cache hit probability
• Why not build bigger cache ?
Expensive
Cache access latency increases with cache size
• Cache is built into stages – L1, L2, L3 with varying cache access
latency
• ESX benefits from larger cache sizes
• L3 cache seems to boost performance of networking workloads
Hardware and Performance
TLB – Translation Lookaside Buffer
• Every running process needs virtual address (VA) to physical
address (PA) translation
• Historically this translation table was done entirely from memory
• Since memory access is significantly slower and process needs
access to this table on every context switch, TLB was introduced
• TLB is a hardware circuitry that caches VA to PA mappings
• When VA is not available in TLB, Page Fault occurs and OS needs
to bring the address to TLB (load latency)
• Performance of application depends on effective use of TLB
• TLB is flushed during context switch
Hardware and performance
Hyperthreading
• Introduced with Pentium 4 and Xeon processors
• Allows simultaneous execution of two threads on a single processor
• HT maintains separate architectural states for the same processor
but shares underlying processor resources like execution unit,
cache etc
• HT strives to improve throughput by taking advantage of processor
stalls on the logical processor
• HT performance could be worse than UniProcessor (non-HT)
performance if the threads have higher cache hit (more than 50%)
Hardware and Performance
Multicores
• Cores have their own L1 Cache
• L2 Cache is shared between processors
• Cache coherency is relatively faster compared to SMP systems
• Performance scaling is same as SMP systems
Hardware and performance
NUMA
• Memory contention increases as the number of processors increase
• NUMA alleviates memory contention by localizing memory per
processor
Hardware and Performance - Memory
Node Interleaving
•
Opteron processors supports two type of memory access –
NUMA and Node Interleaving mode
•
Node interleaving mode alternates memory pages between
processor nodes so that the memory latencies are made uniform.
This can offer performance improvements to systems that are not
NUMA aware
•
NUMA on single core Opteron systems contains only one core
per NUMA node.
•
SMP VM on ESX running on a single core Opteron systems will
have to access memory across the NUMA boundary. So SMP
VMs may benefit from Node Interleaving
•
On dual core Opteron systems a single NUMA node will have two
cores. So NUMA mode could be turned on.
Hardware and Performance – I/O devices
I/O Devices
• PCI-E, PCI-X, PCI
PCI at 66MHz – 533 MB/s
PCI-X at 133 MHz – 1066 MB/s
PCI-X at 266 MHz – 2133 MB/s
PCI-E bandwidth depends on the number of Lanes, x16 Lanes - 4GB/s,
each Lane adds 250 MB/s.
• PCI bus saturation – dual port, quad port devices
In PCI protocol the bus bandwidth is shared by all the devices in the bus.
Only one device could communicate at a time.
PCI-E allows parallel full duplex transmission with the use of Lanes
Hardware and Performance – I/O Devices
SCSI
• Ultra3/Ultra 160 SCSI – 160 MB/s
• Ultra320 SCSI – 320 MB/s
• SAS 3Gbps– 300 MB/s duplex
FC
• Speed constrained by Medium, Laser wavelength
• Link Speeds: 1G FC – 200 MB/s, 2G – 400 MB/s, 4G – 800 MB/s,
8GB – 1600 MB/s
ESX Architecture
Performance Perspective
17
Confidential
ESX Architecture – Performance Perspective
CPU Virtualization – Virtual Machine Monitor
• ESX doesn’t trap and emulate every instruction, x86 arch does not
allow this
• System calls and Faults are trapped by the monitor
• Guest code runs in one of three contexts
Direct execution
Monitor code (fault handling)
Binary Translation (BT - non virtualizable instructions)
• BT behaves much like JIT
• Previously translated code fragments are stored in translation cache
and reused – saves translation overhead
ESX Architecture – Performance Implications
Virtual Machine Monitor – Performance implications
• Programs that don’t fault or invoke system calls run at near native
speeds – ex. Gzip
• Micro-benchmarks that do nothing but invoke system calls will incur
nothing but monitor overhead
• Translation overhead varies with different Privileged instructions.
Translation cache tries to offset some of the overhead.
• Applications will have varying amount of monitor overhead
depending on their call stack profile.
• Call stack profile of an application can vary depending on its
workload, errors and other factors.
• It is hard to generalize monitor overheads for any workload. Monitor
overheads measured for an application are strictly applicable only to
“Identical” test conditions.
ESX Architecture – Performance Perspective
Memory virtualization
• Modern OS’es set up page tables for each running process. x86
paging hardware (TLB) caches VA - PA mappings
• Page table shadowing – additional level of indirection
VMM maintains PA – MA mappings in a shadow table
Allows the guest to use x86 paging hardware with the shadow table
• MMU updates
VMM write protects shadow page tables (trace)
When the guest updates page table, monitor kicks in (page fault) and
keeps shadow page table consistent with the physical page table
• Hidden page faults
Trace faults are hidden to the guest OS - monitor overhead.
Hidden page faults are similar to TLB misses on native environments
ESX Architecture – Performance Perspective
Page table shadowing
ESX Architecture – Performance Implications
Context Switches
• On Native hardware TLB is flushed during a context switch. Newly
switched process will incur TLB miss on first memory access.
• VMM caches Page Table Entries (PTE) during context switches
(caching MMU). We try to keep the Shadow PTE consistent with the
Physical PTE
• If there are lots of processes running in the guest, and they context
switch frequently, VMM may run out of PT caching.
Workload=terminalservices increases this cache size (vmx).
Process creation
• Every new process created requires new PT mapping. MMU
updates are frequent
• Shell Scripts that spawns commands can cause MMU overhead
ESX Architecture – Performance Perspective
I/O Path
ESX Architecture – Performance Perspective
I/O Virtualization
• I/O devices are non virtualizable and therefore they are emulated in
the guest OS
• VMkernel handles Storage and Networking devices directly as they
are performance critical in server environments. CDROM, floppy
devices are handled by the service console.
• I/O is interrupt driven and therefore incurs monitor overhead. All I/O
goes through VMkernel and involves a context switch from VMM to
VMKernel
• Latency of networking device is lower and therefore delay due to
context switches can hamper throughput
• VMkernel fields I/O interrupts and delivers it to correct VM. From
ESX 2.1, VMKernel delivers the interrupts to the idle processor.
ESX Architecture – Performance Perspective
Virtual Networking
• Virtual NICs
Queue buffer could overflow
- if the pkt tx/rx rate is high
- VM is not scheduled frequently
VMs are scheduled when they have packets for delivery
Idle VMs still receive broadcast frames. Wastes CPU resources.
Guest Speed/duplex settings is irrelevant.
• Virtual Switches don’t learn MAC address
VMs register MAC address, virtual switch knows the location of the MAC
• VMnics
Listens for the MAC addresses that are registered by the VMs.
Layer 2 Broadcast frames are passed above
ESX Architecture – Performance Perspective
NIC Teaming
• Teaming only provides outbound load balancing
• NICs with different capabilities could be teamed
Least common Capability in the bond is used
• Out-MAC mode scales with number of VMs/virtual NICs. Traffic from
a single virtual NIC is never load balanced.
• Out-IP scales with the number of Unique TCP/IP sessions.
• Incoming traffic can come on the same NIC. Link aggregation on the
physical switches provides inbound load balancing.
• Packet reflections can cause performance hits in the guest OS. No
empirical data available.
• We Failback when the Link comes alive again.
Performance could be affected if the Link flips flops.
ESX Architecture – Performance Perspective
vmxnet optimizations
• vmxnet handles cluster of packets at once – reduces context
switches and interrupts
• Clustering kicks in only when the packet receive/transmit rate is
high.
• vmxnet shares memory area with VMkernel – reduces copying
overhead
• vmxnet can take advantage of TCP checksum and Segmentation
offloading (TSO)
• NIC Morphing – allows loading vmxnet driver for valance virtual
device. Probes a new register with the valance device.
• Performance of a NIC Morphed vlance device is same as the
performance of vmxnet virtual device.
ESX Architecture – Performance Perspective
SCSI performance
• Queue depth determines the SCSI throughput. When the queue is
full, SCSI I/O’s are blocked limiting effective throughput.
• Stages of Queuing
Buslogic/LSILogic -> VMkernel Queue -> VMkernel Driver Queue depth > Device Firmware Queue -> Queue depth of the LUN
• Sched.numrequestOutstanding – number of outstanding I/O
commands per VM – see KB 1269
• Buslogic driver in windows limits the queue depth size to 1 – see KB
1890
• Registry settings available for maximizing queue depth for LSILogic
adapter (Maximum Number of Concurrent I/Os)
ESX Architecture – Performance Perspective
VMFS
• Uses larger block sizes (1MB default)
Larger block size reduces Metadata size – metadata is completely cached
in memory
Near native speeds is possible, because metadata overhead is removed
Fewer I/O operations. Improves read-ahead cache hits for sequential
reads
• Spanning
Data is filled to the other LUN sequentially after overflow. There is no
striping.
Does not offer performance improvements.
• Distributed Access
Multiple ESX hosts can access the VMFS volume, only one ESX host
updates the meta-data
ESX Architecture – Performance Perspective
VMFS
• Volume Locking
Metadata updates are performed through locking mechanism
SCSI reservation is used to lock the volume
Do not confuse this locking with the file level locks implemented in the
VMFS volume for different access modes
• SCSI reservation
SCSI reservation blocks all I/O operations until the lock is released by the
owner
SCSI reservation is held usually for a very short time and released as
soon as the update is performed
SCSI reservation conflict happens when SCSI reservation is attempted on
a volume that is already locked. This usually happens when multiple ESX
hosts contend for metadata updates
ESX Architecture – Performance Perspective
VMFS
• Contention for metadata updates
Redo log updates from multiple ESX hosts
Template deployment with redo log activity
Anything that changes/modifies file permission on every ESX host
• VMFS 3.0 uses new volume locking mechanism that significantly
reduces the number of SCSI reservations used
ESX Architecture – Performance Perspective
Service Console
• Service console can share Interrupt resources with VMkernel.
Shared interrupt lines reduce performance of I/O devices – KB 1290
• MKS is handled in the service console in ESX 2.x. and its
performance is determined by the resources available in the COS
• The default Min CPU allocated is 8% and may not be sufficient if
there are lots of VMs running
• Memory recommendations for service console do not account
memory that will be used by the agents
• Scalability of VMs is limited by COS in ESX 2.x. ESX 3.x avoids this
problems with userworlds for VMkernel.
Understanding ESX Resource
Management & Over-Commitment
33
Confidential
ESX Resource Management
Scheduling
• Only one VCPU runs on a CPU at any time
• Scheduler tries to run the VM on the same CPU as much as possible
• Scheduler can move VMs to others Processors when it has to meet the CPU
demands of the VM
Co-scheduling
• SMP VMs are co-scheduled, i.e. all the VCPUs run on their own
PCPUs/LCPUs simultaneously
• Co-scheduling facilitates synchronization/communication between
processors, like in the case of spinlock wait between CPUs
• Scheduler can run a VCPU without the other for a short period of time (1.5
ms)
• Guest could halt the co-scheduled CPU, if it is not using it, but Windows
doesn’t seem to halt the CPU – wastes CPU cycles
ESX Resource Management
NUMA Scheduling
• Scheduler tries to schedule the world within the same NUMA node
so that cross NUMA migrations are fewer
• If a VM’s memory pages are split between NUMA nodes, the
memory scheduler slowly migrates all the VM’s pages to the local
node. Over time the system becomes completely NUMA balanced.
• On NUMA architecture, CPU utilization per NUMA node gives better
idea of CPU contention
• While factoring %ready, factor the CPU contention within the same
NUMA node.
ESX Resource Management
Hyperthreading
• Hyperthreading support was added in ESX 2.1, recommended
• Hyperthreading increases scheduler’s flexibility especially in the
case of running SMP VMs with UP VMs
• A VM scheduled on a LCPU is charged only half the “package
seconds”
• Scheduler tries to avoid scheduling a SMP VM onto the logical
CPUS of the same package
• A high priority VM may be scheduled to a package with one its of
LCPU halted – this prevents other running worlds from using the
same package
ESX Resource Management
HTSharing
• Controls hyperthreading behavior with individual VMs.
• htsharing=any
Virtual CPUs could be scheduled on any LCPUs. Most flexible option for the
scheduler.
• htsharing=none
Excludes sharing of LCPUs with other VMs. VM with this option gets a full package
or never gets scheduled.
Essentially this excludes the VM from using logical CPUs (useful for the security
paranoid). Use this option if an application in the VM is known to perform poorly with
HT.
• htsharing=internal
Applies to SMP VMs only. This is same as none, but allows sharing the same
package for the VCPUs of the same VM. Best of both worlds for SMP VMs.
For UP VMs this translates to none
ESX Resource Management
HT Quarantining
• ESX uses P4 Performance counters to constantly evaluate HT
performance of running worlds
• If a VM appears to interact badly with HT, the VM is automatically
placed into a quarantining mode (i.e. htsharing is set to none)
• If the bad events disappear, the VM is automatically pulled back
from quarantining mode
• Quarantining is completely transparent
ESX Resource Management
CPU affinity
• Defines a subset of LCPUs/PCPUs that a world could run on
• Useful to
partition server between departments
troubleshoot system reliability issues
For manually setting NUMA affinity in ESX 1.5.x
applications that benefit from cache affinity
• Caveats
Worlds that don’t have affinity can run on any CPU, so they have better chance of
getting scheduled
Affinity reduces Schedulers capability to maintain fairness – min CPU guarantees
may not be possible under some circumstances
NUMA optimizations (page migrations) are excluded for VMs that have CPU affinity
(can enforce manual memory affinity)
SMP VMs should not be pinned to LCPUs
Disallows vMotion operations
ESX Resource Management
Proportional Shares
• Shares are used only when there is a resource contention
• Unused shares (shares of a halting/idling VM) are partitioned across
active VMs.
• In ESX 2.x shares operate on a flat namespace
• Changing shares of one world affects the effective CPU cycles
received by other running worlds.
• If VMs use a different share scale then shares for other worlds
should be changed to the same scale
ESX Resource Management
Minimum CPU
• Guarantees CPU resources when the VM requests for it
• Unused resources are not wasted, and is given to other worlds that
requires it.
• Setting min CPU to 100% (200% in case of SMP) ensures that the
VM is not bound by the CPU resource limits
• Using min CPU is favored over using CPU affinity or proportional
shares
• Admission control verifies if Min CPUs could be guaranteed when
the VM is powered on or VMotioned
ESX Resource Management
Demystifying “Ready” time
• Powered on VM could be either running, halted or in a ready state
• Ready time signifies the time spent by a VM on the run queue waiting to be
scheduled
• Ready time accrues when more than one world wants to run at the same
time on the same CPU
PCPU, VCPU over-commitment with CPU intensive workloads
Scheduler constraints - CPU affinity settings
• Higher ready time reduces response times or increases job completion time
• Total accrued ready time is not useful
VM could have accrued ready time during their runtime without incurring performance
loss (for example during boot)
• %ready = ready time accrual rate
ESX Resource Management
Demystifying “Ready” time
• There are no good/bad values for %ready.
Depends on the priority of the VMs - latency sensitive applications may
require less or no ready time
• Ready time could be reduced by increasing the priority of the VM
Allocate more shares, set minCPU, remove CPU affinity
ESX Resource Management
Unexplained “Ready” time
• If the VM accrues ready time while there are enough CPU resources
then it is called “Unexplained Ready time”
• There are some belief in the field that such a thing actually exists –
hard to prove or disprove
• Very hard to determine if CPU resources are available when ready
time accrues
CPU utilization is not a good indicator of CPU contention
Burstiness is very hard to determine
NUMA boundaries – All VMs may contend within the same NUMA node
Misunderstanding of how scheduler works
ESX Resource Management
Resource Management in ESX 3.0
• Resource Pools
Extends hierarchy. Shares operate within the resource pool domain.
• MHz
Resource allocation are absolute based on clock cycles. % based
allocation could vary with processor speeds.
• Clusters
Aggregates resources from multiple ESX hosts
Resource Over-Commitment
CPU Over-Commitment
• Scheduling
Too many things to do!
Symptoms: high %ready
Judicious use of SMP
• CPU utilization
Too much to do!
Symptoms: 100% CPU
Things to watch
- Misbehaving applications inside the guest
- Do not rely on Guest CPU utilization – halting issues, timer interrupts
- Some applications/services seem to impact guest halting behavior. No longer tied
to SMP HALs.
Resource Over-Commitment
CPU Over-Commitment
• Higher CPU utilization does not necessarily mean lesser
performance.
Application’s progress is not affected by higher CPU utilization
However if higher CPU utilization is due to monitor overheads then it may
impact performance by increasing latency
When there is no headroom (100% CPU), performance degrades
• 100% CPU utilization and %ready are almost identical – both delay
application progress
• CPU Over-Commitment could lead to other performance problems
Dropped network packets
Poor I/O throughput
Higher latency, poor response time
Resource Over-Commitment
Memory Over-Commitment
• Guest Swapping - Warning
Guest page faults while swapping.
Performance is affected by both guest swapping and due to monitor overhead
handling page faults.
Additional disk I/O
• Ballooning – Serious
• VMkernel Swapping - Critical
• COS Swapping - Critical
VMX process could stall and affect the progress of the VM
VMX could be a victim of random process killed by the kernel
COS requires additional CPU cycles, for handling frequent page faults and disk I/O
• Memory shares determine the rate of ballooning/swapping
Resource Over-Commitment
Memory Over-Commitment
• Ballooning
Ballooning/swapping stalls processor, increases delay
Windows VMs touches all allocated memory pages during boot. Memory
pages touched by the guest could be reclaimed only by ballooning
Linux guest touches memory pages on demand. Ballooning kicks in only
when the guest is under complete memory pressure
Ballooning could be avoided by using min=max
/proc/vmware/sched/mem
- size <>sizetgt indicates memory pressure
- mctl > mctlgt – ballooning out (giving away pages)
- mctl < mctlgt – ballooning in (taking in pages)
Memory shares affect ballooning rate
Resource Over-Commitment
Memory Over-Commitment
• VMKernel Swapping
Processor stalls due to VMkernel swapping are more expensive than
ballooning (due to disk I/O)
Do not confuse this with
- Swap reservation: Swap is always reserved for worst case scenario if
min<> max, reservation = max – min
- Total swapped pages: Only current swap I/O affects performance
/proc/vmware/sched/mem-verbose
- swpd – total pages swapped
- swapin, swapout – swap I/O activity
SCSI I/O delays during VMKernel I/O swapping could result in system
reliability issues
Resource Over-Commitment
I/O bottlenecks
• PCI Bus saturation
• Target device saturation
Easy to saturate storage arrays if the topology is not designed correctly for load
distribution
• Packet drops
Effective throughput reduces
Retransmissions can cause congestion
Window size scales down in the case of TCP
• Latency affects throughput
TCP is very sensitive to Latency and packet drops
• Broadcast traffic
Multicast and broadcast traffic sent to all VMs.
• Keep an eye on Pkts/sec and IOPS and not just bandwidth consumption
ESX Performance
Application Performance issues
52
Confidential
ESX Performance – Application Issues
Before we begin
• From VM perspective, an running application is just a x86 workload.
• Any Application performance tuning that makes the application to run more
efficiently will help
• Application performance can vary between versions
New version could be more or less efficient
Tuning recommendations could change
• Application behavior could change based on its configuration
• Application performance tuning requires intimate knowledge on how the
application behaves
• Nobody at VMware specializes on application performance tuning
Vendors should optimize their software with the thought that the hardware resources
could be shared by other Operating Systems.
TAP program
- SpringSource (unit of VMware) – Provides developer support for API scripting
ESX Performance – Application issues
Citrix
• Roughly 50-60% monitor overhead – takes 50-60% more CPU cycles than
on the native machine
• The maximum number of users limit is hit when the CPU is maxed out –
roughly 50% of users as would be seen on native environment with an
apples to apples comparison.
• Citrix Logon delays
This could happen even on native machines when roaming profiles are configured.
Refer Citrix and MS KB articles
Monitor overhead can introduce logon delays
• Workarounds
Disable com ports, workload=terminalservices, disable unused apps, scale
horizontally
• ESX 3.0 improves Citrix performance – roughly 70-80% of native
performance
ESX Performance – Application issues
Database performance
• Scales well with vSMP – recommended
Exceptions: Pervasive SQL – not optimized for SMP
• Two key parameters for database workloads
Response time
- Transaction logs
CPU utilization
• Understanding SQL performance is complex. Most enterprise
databases run some sort of query optimizer that changes the SQL
Engine parameters dynamically
Performance will vary with run time. Typically benchmarking is done after
priming the database
• Memory resource is key. SQL performance can vary a lot depending
on the available memory.
ESX Performance – Application Issues
Lotus Domino Server
• One of the better performing workloads. 80-90% of direct_exec
• CPU and I/O intensive
• Scalability issues – Not a good idea to run all domino servers on the
same ESX server.
ESX Performance – Application Issues
16-bit applications
• 16 bit applications on windows NT/2000 and above run in a
Sandboxed Virtual Machine
• 16 bit apps depend on segmentation – possible monitor overhead.
• Some 16-bit apps seem to spin idle loop instead of halting the CPU
Consumes excessive CPU cycles
• No performance studies done yet
No compelling application
ESX Performance – Application Issues
Netperf – throughput
• Max Throughput is bound by a variety of parameters
Available Bandwidth, TCP window size, available CPU cycles
• VM incurs additional CPU overhead for I/O
• CPU utilization for networking varies with
Socket buffer size, MTU – affects the number of I/O operations performed
Driver – vmxnet consumes lesser CPU cycles
Offloading features – depending on the driver settings and NIC
capabilities
• For most applications, throughput is not the bottleneck
Measuring throughput and improving it may not always resolve the
underlying performance issue
ESX Performance – Application Issues
Netperf – Latency
• Latency plays an important role for many applications
• Latency can increase
When there are too many VMs to schedule
VM is CPU bound
Packets are dropped and then re-transmitted
ESX Performance – Application Issues
Compiler Workloads
• MMU intensive: Lots of new processes created, context switched,
and destroyed.
• SMP VM may hurt performance
Many compiler workloads are not optimized by SMP
Process threads could ping-pong between the vCPUs
• Workarounds:
Disable NPTL
Try UP (don’t forget to change the HAL)
Workload=terminalservices might help
ESX Performance Forensics
61
Confidential
ESX Performance Forensics
Troubleshooting Methodology
• Understand the problem.
Pay attention to all the symptoms
Pay less attention to subjective metrics.
• Know the mechanics of the application
Find how the application works
What resources it uses, and how it interacts with the rest of the system
• Identify the key bottleneck
Look for clues in the data and see if that could be related to the symptoms
Eliminate CPU, Disk I/O, Networking I/O, Memory bottlenecks by running
tests
• Running the right test is critical.
ESX Performance Forensics
Isolating memory bottlenecks
• Ballooning
• Swapping
• Guest MMU overheads
ESX Performance Forensics
Isolating Networking Bottlenecks
• Speed/Duplex settings
• Link state flapping
• NIC Saturation /Load balancing
• Packet drops
• Rx/Tx Queue Overflow
ESX Performance Forensics
Isolating Disk I/O bottlenecks
• Queue depth
• Path thrashing
• LUN thrashing
ESX Performance Forensics
Isolating CPU bottlenecks
• CPU utilization
• CPU scheduling contention
• Guest CPU usage
• Monitor Overhead
ESX Performance Forensics
Isolating Monitor overhead
• Procedures for release builds
Collect performance snapshots
• Monitor Components
ESX Performance Forensics
Collecting Performance Snapshots
• Duration
• Delay
• Proc nodes
• Running esxtop on performance snapshots
ESX Performance Forensics
Collecting Benchmarking numbers
• Client side benchmarks
• Running benchmarks inside the guest
ESX Performance
Troubleshooting - Summary
70
Confidential
ESX Performance Troubleshooting - Summary
Key points
• Address real performance issues. Lots of time could be spent on spinning
wheels on theoretical benchmarking studies
• Real performance issues could be easily described by the end user who
uses the application
• There is no magical configuration parameter that will solve all performance
problems
• ESX performance problems are resolved by
Re-architecting the deployment
Tuning application
Applying workarounds to circumvent bad workloads
Moving to a newer version that addresses a known problem
• Understanding Architecture is the key
Understanding both ESX and application architecture is essential to resolve
performance problems
Questions?
Reference links
http://www.vmware.com/files/pdf/perf-vsphere-memory_management.pdf
http://www.vmware.com/resources/techresources/10041
http://www.vmware.com/resources/techresources/10054
http://www.vmware.com/resources/techresources/10066
http://www.vmware.com/files/pdf/perf-vsphere-cpu_scheduler.pdf
http://www.vmware.com/pdf/RVI_performance.pdf
http://www.vmware.com/pdf/Perf_ESX_Intel-EPT-eval.pdf
http://www.vmware.com/files/pdf/perf-vsphere-fault_tolerance.pdf