Session Code Title of Presentation

Download Report

Transcript Session Code Title of Presentation

MSG347
Monitoring and Analyzing System
Performance for Exchange
Pierre Bijaoui
(Hewlett-Packard)
Slide Guidelines
Subtitle Color
• Slides should emphasize key points
• Limit to 6 lines per slides
• Limit to 6 words per line
• Font, size, and color for text have been
formatted for you in the Slide Master
Goal: How To Pinpoint Causes Of
Poor Exchange Performance?
• Tools
• Windows Performance Monitor (Perfmon)
• Microsoft Operations Manager (MOM) +
Exchange Management Pack
• This talk is very detailed!
• Slides are available
• Don’t try to take detailed notes now
• Getting good at this analysis will take
practice
• Here’s a kick-start!
Format Note
• Performance Monitor counters will be in the
following format
Object(instance)\counter name
Object\counter name
Pinpointing
Performance Problems
What to do when clients say
their mail is slow…
• Basic process is deductive
• Start at top and eliminate possibilities
Question 1: Is The Problem
Exchange Or “Before” Exchange?
Are Requests Even Getting To Exchange?
• Use 2 counters
• MSExchangeIS\RPC Requests:
•
MAPI RPC requests currently being processed
MSExchangeIS\RPC Operations/sec:
rate at which requests are being processed
• Problem is before Exchange if
• Operations/sec is low and
• Outstanding requests is zero
• All other combinations  problem is Exchange
or something after Exchange
Example
Exchange Problem
No operations are executing but the store has
outstanding requests for 3 minute period in the middle
Store has
outstanding
requests
No operations
are executing
for 3 minutes
Example
Exchange Problem
Four periods of increasing outstanding requests
while throughput drops
Example
Client Problem


Somebody running a utility or a test script?
Use NetMon to find from which machine the
requests are coming
Example
A Network Problem

Use NetMon to determine whether
requests are arriving at server
Getting The Right Info Upfront
Questions about the problem
• Are clients experiencing sluggishness or
are clients hanging?
• Is it happening with a particular operation?
• Does everyone experience the problem at
the same time?
• At what frequency does this occur?
Getting The Right Info Upfront
Questions about the hardware
• How many CPU’s on the server?
• How much memory on the server?
• For each physical disk volume
• how many disks
• how are they configured (RAID-0, 1 or 5)?
If The Problem Is On The Server…
First step: Is there a physical resource
bottleneck?
Questions
• Is there a CPU bottleneck?
• Is there a Disk bottleneck?
• Is there a memory bottleneck?
Is There A CPU Bottleneck?
• Easy to detect
• Processor(_Total)\% Processor Time
•
approaches 100%
System\Processor Queue Length
above # of processors too often
• Caveat
• Full Text Indexing…(pause crawl)
• If CPU is high
• Is MSExchangeIS\RPC Requests increasing?
• Getting close or above 30 is BAD and can
cause client timeouts
CPU Bottleneck
• Message Delivery spike leads to
CPU bottleneck
CPU ~ 100%
Who Is Consuming The CPU?
• The likely suspects (in order)
Process(store)\% Processor Time
Process(inetinfo)\% Processor Time
Process(emsmta)\% Processor Time
Process(mssearch)\% Processor Time
Process(mssdmn)\% Processor Time
Process(system)\% Processor Time
Total of these  90% of the CPU used
Who Is Consuming the CPU?
“Histogram view”
Who Is Consuming The CPU?
• Likely sources of problems
• Backup utilities; AV/AS
• Monitoring utilities (WinMgmt, MAD)
• Remote access tools (WinVNC, TermSrv)
• Note
Process counters  100% = one full processor
E.g., 8-proc server
0 < Process(process)\% Processor Time < 800%
Disk Bottleneck Detection



Much fuzzier than CPU bottlenecks
 present 3 approaches
Always remember: A disk bottleneck may
actually be the symptom of a memory problem
Best Practice


Size for disk i/o capacity first, instead
of disk space
Run diskperf –y
 enables on logical and physical
disk counters
Disk Bottleneck Approach 1
PhysicalDisk(drive:)\Disk Writes/sec
PhysicalDisk(drive:)\Disk Reads/sec
 Look at all drives – compare to total
 Isolate where the I/O is going
 Rule of thumb estimate for disk random i/o
Raid-0: Reads/s +
Writes/s < # Spindles X 100
Raid-1: Reads/s + 2 * Writes/s < # Spindles X 100
Raid-5: Reads/s + 4 * Writes/s < # Spindles X 100
Assumes disk throughput = 100 random i/o per spindle
Disk Bottleneck Approach 2

I/O requests waiting to be completed
PhysicalDisk(drive:)\Avg. Disk Queue
 average over the sampling interval
PhysicalDisk(drive:)\Current Disk Queue
 instantaneous value


Disk bottleneck if

Average queue >> number of spindles on the array

Current Disk Queue never hits zero
Correlate spikes with MSExchangeIS\RPC
Requests to confirm effect on clients
Disk Bottleneck Approach 3
• I/O latency  sensitive to disk health
PhysicalDisk(drive:)\Avg. Disk sec/Read
PhysicalDisk(drive:)\Avg. Disk sec/Write
•
•
Typical range: 0.005 to 0.020 seconds for
random I/O
Write caching in array controller 
sec/write < 0.001
Likely bottleneck: 0.020 - 0.050 seconds
Definite bottleneck: > 0.050
What Is Causing The I/O?
• Identify drives with high I/O…
• May identify if it is likely to be the paging file,
.edb, .stm, .log, or routing queue files
• With Windows 2000 Server, you can use
Process(process name)\IO Read Operations/sec
Process(process name)\IO Write Operations/sec
 qualitative feel for which process is doing I/O
Where Is The I/O Going?
Filemon
• Choose the logical disks which
needs investigation
• Shows all disk reads and writes (size,
which file, etc.)
• Useful for multi-use disk (e.g. C:)
• See http://www.sysinternals.com
Filemon Example
Physical Memory
• Start with Memory\Available MBytes
• Available MBytes < 4MB
 Windows aggressive cuts working sets
• Server clearly healthy if Available MBytes >> 4MB
• Check for paging problems with
• Memory\pages/sec (total pages to/from disk)
• Memory\page reads/sec (total paging reads)
• Memory\page writes/sec (total paging writes)
• Paging I/O is normal  Exchange 2000 uses Windows NT
•
system cache for the .stm file
Check that paging I/O is from the page file with physical
disk counters!
Monitoring Physical Memory
The Less-Useful Counters
• Memory\Page Faults/sec is often not an
indication of a problem as it includes
• Memory\Cache Faults/sec normal part of Exchange
2000 operation because of .stm file
• Both “Page Faults” and Cache Faults” include
• Memory\Transition Faults/sec: Faults that don’t go to
disk (memory manager has the pages on the standby
list)
• Process(process)\Page Faults/sec: Guide to find
rogue processes (use histogram trick)
Monitoring Memory
Where Did It Go?
Likely suspects
• Process(store)\Working Set
 most of committed bytes
(due to Database\Cache Bytes)
• Process(inetinfo)\Working Set
• Process(emsmta)\Working Set
• Memory\Cache Bytes
 Histogram to find processes with large
working sets…
Virtual Memory
A.k.a., Address Space
• Best Practice
Set the /3GB switch in Boot.ini for dedicated
Exchange 2000 servers with > 1 Gb memory
• Requires Windows 2000 Adv. Server
•
or Datacenter
Set /USERVA=3030 on Windows Server 2003
• Enterprise Edition and above
• Process(store)\virtual bytes: Want
•
>200MB free
• Note: 3 GBytes = 3.22x109 bytes
Why is this important?
Virtual Memory
Fragmentation
Over time
Very high
fragmentation
• Cluster failover
•
may not work if
receiving node is
highly fragmented!
Need to monitor
VM carefully…
Monitoring Virtual Memory
Exchange 2000 SP1 additions
• Perfmon Counters to monitor VM fragmentation
•
(cluster failover)
• MSExchangeIS: VM Largest Block Size
• MSExchangeIS: VM Total Free Blocks
• MSExchangeIS: VM Total Large Free Block Bytes
• MSExchangeIS: VM Total 16MB Free Blocks
MSExchangeIS events
• Event 9852 (warning and error severity)
 warns of few large contiguous blocks of VM
Kernel Memory
• 32-bit OS limits kernel memory space
• Limits are computed at server startup
• Based on amount of physical memory and
number of processors
• /3gb switch limits kernel memory space
dramatically
Memory\Paged Pool Bytes
• Kernel memory space that can be paged
•
•
•
out to disk
Max of 196mb for a server with >1024Mb of
physical memory and /3gb switch
• 270mb without /3gb switch set
When max is hit, server  unresponsive
Increasing paged pool bytes…indicative of
• Handle leaks  Check process handles counters
• Growing SMTP queue
Memory\Pool Non-paged Bytes
• Kernel memory space that cannot be paged out
•
to disk
Max of 96mb on servers with more than 512mb
with /3gb switch
• 250mb without /3gb
• Increases are is often indicative of
•
• Driver leak (SCSI etc)
• Excessive number of TCP/IP connections
System will become unresponsive when it
reaches max
Memory: Free System Page Table
Entries (PTEs)
• Kernel memory space used to back I/O and
•
•
•
network buffers
Generally 61k available PTEs on /3gb server with 1GB
physical RAM
• 450k without /3gb switch
Healthy server if >5000
Unhealthy server if <3000
• May drop network packets and/or disk I/O's
• Especially problematic on large, 8 processor servers with
•
thousands of users
See Q313707 Exchange 2000 w. /3GB Switch Loses
Network Connectivity
Everything Checks Out But Server
Still ‘Slow’
• Exchange depends on the Active Directory
 Check out bottlenecks on your
AD servers
• CPU bottleneck?
• Disk bottleneck?
• Insufficient Memory?
Most techniques discussed to identify
problems with Exchange 200x are equally
applicable to Windows 200x Active Directory
DSAccess Counters
Making Sure Caching Is Happening
• DSAccess reduces load on DS by caching
requests
• Important counters to check operation
• MSExchangeDSaccess Caches\
Cache Hits/Sec
• MSExchangeDSaccess Caches\
LDAP Searches/Sec
• Compare to baseline rates when server is
performing well
Problem Is
“Before” Exchange
• Check network counters
Network Interface(netcard)\bytes received/sec
Network Interface(netcard)\bytes sent/sec
The network is rarely a bottleneck.
However, incorrect backup schedules, can cause
problems
• Next stop, client side sniffs – are the
packets really getting to the server?
Measuring Non-MAPI Requests
• Analog of “RPC requests”
 Epoxy queue object counters
Epoxy(protocol)\Client Out Que Len
Epoxy(protocol)\Store Out Que Len
protocol = POP3, IMAP4, SMTP, DAV, and NNTP
• Client Out Que Len: Number of requests waiting
•
to be picked up by the store
Store Out Que Len: Number of requests waiting
to be picked up by the Internet Information Server
protocol handlers
Message Delivery Counters
• Server responds to user
•
requests preferentially
Delivery queues  first sign of an overload
• SMTP Server\Local Queue Length
• Should not grow continuously
• Peak periods: Growing and shrinking in the range of
0-1000 is reasonable
• SMTP Server\Messages Delivered/sec
• Should be continuous
• Gaps of zero delivery followed by spikes are
indicative of other bottlenecks
Keeping Servers Healthy
Keeping Servers Healthy
• Monitor servers continuously!
• If you can identify bottlenecks, you can tell
• when you don’t have them and
• when you are getting close
• But only if you are monitoring!
• Need a baseline!
• E.g., is today’s problem is due to
•
•
•
•
Increased load
Mail storm
Virus
Hardware problem
Monitoring Strategies
With Perfmon
• Keep live views w/different sample times,
e.g.,
• 900 seconds for a 24 hour view
• 1 second to catch short lived spikes
• Add minimal set of important counters
• Study your busiest server – why it
is different?
• Save reference logs (baseline data)
A Minimal Set Of Counters
1. Processor(_Total)\% Processor Time
2. System\Processor Queue Length
3. Process(store )\% Processor Time
4. PhysicalDisk(xxx)\Disk Transfers/sec
5. PhysicalDisk(xxx)\Avg. Disk sec/Transfer
6. MSExchangeIS\RPC Requests
7. MSExchangeIS\RPC Operations/sec
8. SMTP Server\Local Queue Length
9. SMTP Server\Messages Delivered/sec
10.MSExchangeIS Mailbox\Local Delivery Rate
11.MSExchangeIS Mailbox\Folder Opens/sec
12.MSExchangeIS Mailbox\Message Opens/sec
Do You Know?
•
•
•
•
•
•
Number of messages received/user per day?
How many do they download?
How often do they open folders?
What is the
• Peak delivery rate?
• Peak period during the day?
• Peak day of the week?
Are there monthly/quarterly peaks?
How many more users can your servers support?
Maybe there’s an easier way…
Making This Easier…
• Microsoft Operations Manager
and
• Exchange Management Pack
• Watch all of the bottleneck analysis perf
counters and much more
Goals Of The Exchange
Management Packs
• Facilitate high availability Exchange
operations
• Monitor broadly  maximum pre-emptive
alerting
• Facilitate lower time-to-resolution:
Management Pack knowledge base
• Rapid diagnosis
• Quick resolution
Questions
Exchange Survey
• Help us understand your requirements
• Available via CommsNet
• Daily Drawings for Windows Mobile
Smartphones!
• http://www.researchhq.com/messagingsurvey
Microsoft Learning
• Microsoft® Exchange Server 2003
Administrator's Companion ISBN:0-73561979-4
Community Resources
• Community Resources
• http://www.microsoft.com/communities/default.mspx
• Most Valuable Professional (MVP)
• http://mvp.support.microsoft.com/
• Newsgroups
• Converse online with Microsoft Newsgroups,
including Worldwide
• http://communities2.microsoft.com/communit
•
ies/newsgroups/en-us/default.aspx
User Groups - Meet and learn with your peers
• http://www.microsoft.com/communities/usergroups/
default.mspx
evaluations
© 2003 Microsoft Corporation. All rights reserved.
This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.