IBMAdmin2013_Pedisich_Buildingaproactivemonitoring
Download
Report
Transcript IBMAdmin2013_Pedisich_Buildingaproactivemonitoring
Building a Proactive
Monitoring and
Alerting System
Using Native IBM
Domino Tools
Andy Pedisich
Technotics
© 2013 Wellesley Information Services. All rights reserved.
Why Do This Session …
•
•
•
Many Admins want to take advantage of native Notes monitoring
solutions, they just don’t have the bandwidth to explore them
“Free time” is very rare these days
This jumpstart will show you:
How to collect stats
How to analyze stats
How to go behind the scenes
How to set up monitors, alerts
And how to capture just about any little event you are
interested in
And finally, how to configure and work with DDM
Let’s get started
1
What We’ll Cover …
•
•
•
•
•
•
•
•
•
•
•
Looking at the big picture of server monitoring
Understanding statistic generation
Designing an efficient and sensible collection infrastructure
Pulling useful information from statistical data
Using cluster stats to keep clusters reliable
Understanding the essentials of event monitoring
Determining the best notification methods
DDM: Understanding how it fits into your environment
DDM: Crafting a perfect DDM data collection hierarchy
DDM: Looking at DDM events and probes
Wrap-up
2
Driving Your Domino Servers
•
You can learn a lot about the importance of monitoring from
driving your car
Your car tells you a lot about what’s going on
And you know they’re important because you pay attention
to the indicators
You fill the gas tank when it’s low
Unless you are Rob Axelrod (ask Rob)
And (usually) pay attention to the speedometer so you won’t
get a ticket
Or maybe you’re the driver who thinks that red light on the
dash is just for ambience while you’re driving at night
Uh oh
3
Domino Servers Are Obsessed with Statistics
•
•
Domino servers are constantly spewing stats
Just like your car telling you how fast you’re going
Except with Domino there are literally several hundred
statistics generated
Most of them are updated continuously
Many administrators don’t know which ones are important
Or how to tell the good readings from the bad ones
Or what to do about them when they are bad
4
The Truth About Monitoring
•
•
A good administrator shouldn’t have to look very hard
And you can be notified about most problems automatically
You can be proactive about fixing them
When you’re proactive, you put out less fires
Firefighting dilutes your effort
But being notified requires that you monitor your environment for
events and issues
And events depend on statistics
And statistics need to be collected
And too many sites don’t collect stats correctly
Some don’t collect them at all
5
What We’ll Cover …
•
•
•
•
•
•
•
•
•
•
•
Looking at the big picture of server monitoring
Understanding statistic generation
Designing an efficient and sensible collection infrastructure
Pulling useful information from statistical data
Using cluster stats to keep clusters reliable
Understanding the essentials of event monitoring
Determining the best notification methods
DDM: Understanding how it fits into your environment
DDM: Crafting a perfect DDM data collection hierarchy
DDM: Looking at DDM events and probes
Wrap-up
6
Perpetual Statistics
•
Domino servers constantly generate statistics
They track data on a surprising level
On almost every aspect of server operations
Agent manager
Mail and calendaring
The server’s platform
SMTP and Notes mail
LDAP
HTTP
Network
And lots more, too
7
Server Statistics Are Organized Hierarchically
•
Stats are gathered into major categories like these
And then each one has a multitude of subcategories
ADMINP
Mem
Agent
Monitor
Calendar
NET
Database
Platform
Disk
POP3
Domino
Replica
EVENT
Server
HTTP
SMTP
LDAP
Stats
Mail
Update
8
Subcategories of Statistics
•
Here’s a snapshot from the Administrator client showing some of
the statistical hierarchy
This gives you a snapshot of the stats on your server
Use Refresh to get another snapshot
9
Statistics Come in Basic Types
•
•
The basic types of statistics are:
Stats that never change once the server is started
Snapshot stats – reflect what’s going on right now
Cumulative stats that grow from the moment the server is
started
These stats are available to you for:
Your Domino servers
The platform your server is running on
Your network environment
10
Static Statistics
•
Statistics that don’t change usually represent the operating
environment of the server
Server.Version.Notes = Release 8.5.3FP3
Server.Version.OS = Windows NT 5.0
Server.CPU.Type = Intel Pentium
Disk.D.Size = 71,847,784,448
Mem.PhysicalRAM = 527,433,728
11
Amazing Detail, Yours Free!
•
•
•
This includes OS platform, Domino version, RAM
Lots of information about disks in use
Platform.LogicalDisk.TotalNumofDisks = 3
Platform.LogicalDisk.2.AssignedName = E
Disk.C.Size = 80,023,715,840
And even Network Interface Card (NIC) information
Platform.Network.1.AdapterName = Intel[R] PRO_1000 MT
Server Adapter
Platform.Network.2.AdapterName = Broadcom NetXtreme
Gigabit Ethernet _2
Platform.Network.3.AdapterName = Broadcom NetXtreme
Gigabit Ethernet
12
What Good Are These Static Stats?
•
•
Think these static stats aren’t helpful?
Guess again
They are extremely valuable
If you are collecting stats correctly from all your servers, you can
take a pretty detailed server inventory
Without leaving your desk
From servers all around the world, just by looking at the data
we’re going to collect in the Monitoring Results database
This database is also know by its filename: STATREP.NSF
13
Snapshot Statistics
•
Snapshot stats show what’s happening at the moment you
ask for them
They are changing all the time
Disk.E.Free = 18,679,414,784
Server.Users = 280
Mem.Free = 433,614,848
MAIL.Waiting = 250
The best part about this is that you get lots of Domino-related
stats you wouldn’t get by looking at the operating system’s
performance tools
14
Cumulative Stats
•
Some stats are cumulative
They start counting from zero when you start the server
Server.Trans.Total = 31,915
SMTP.MessagesProcessed = 966
Stats, like averages and maximums, are calculated from the
cumulative ones
Server.Users.Peak.Time = 02/21/2006 07:50:33 MST
Platform.Memory.PagesPerSec.Peak = 1,364.1
15
Resetting Statistics
•
•
Some of these cumulative stats can be reset using the following
console command:
Set Statistics statisticname
You can’t use wildcards (*) with this argument!
Here’s an example of why you might want to reset a stat:
Set Stat Server.Trans.Total
Resets the Server.Trans.Total statistic to 0
You might want to reset this stat if:
You are starting to benchmark a new application
You are debugging an agent and want to see if it is more
efficient after changes to its design
16
Platform Stats, Too
•
•
•
Platform stats vary widely from OS to OS
Getting platform stats from within Notes has great value
Track Domino server performance on an OS level even if your
servers run on a variety of operating systems
For example, it’s very common to have a mix of AIX and
Wintel servers
In a few minutes, we’ll be discussing threshold tracking
You’ll be able to set notification thresholds universally from
within Notes to track these platform stats
17
Getting to Platform Statistics
•
•
Domino releases 6, 7, and 8 track platform stats automatically
In earlier versions, they had to be explicitly enabled and many
times were disabled due to problems with servers crashing
These problems are gone
To see all platform stats – enter this console command
Show stat platform
18
A Word About Platform Stats on Partitioned Servers
•
•
Domino collects platform stats that pertain to the whole system
Not to an individual partition
The only statistics that are specific to a partition are those that
reflect tasks, such as process statistics
One partition might run 10 tasks, while another partition runs 15
tasks
Is s u e
19
Confirming Stats with Other Tools
•
Be careful when trying to confirm platform statistics using other
performance monitoring tools
Because of the differences in sampling intervals, you cannot
use native monitoring tools to confirm platform statistics
There will be discrepancies between platform statistics and
those obtained …
Using Perfmon – for Windows 2000
Or a system command, such as this UNIX command:
iostat /vmstat/ netstat
20
See Server Statistics
•
•
•
Quickest way to see all server stats is to enter console command:
Show stat
Any place you can get to a console, you can access stats that can
tell you a lot about the current state of the server
A SHOW STAT command gives you every statistic the Domino
server has
Several hundred of them!
That’s really too many to deal with at once
21
Can I See That in a Smaller Size?
•
•
Get a better view of the stats showing just what you’re looking for
using the asterisk wildcard
You can ask directly for the top level of the hierarchy
Show stat server
That shows all of the stat hierarchy under “server”
22
You Might Want Only Part of the Data
•
To get a select list of just the stats under the top level requires the
use of wildcards in your console commands
If you only want Server.Users hierarchy, use the global “*”
Show stat server.users.*
23
Pushing the Wildcards
•
If you want a closer look, like just grabbing particular sub-levels
of stats, get clever with the wildcard
For example, use the following command to find out about mail
waiting
Show stat mail.wait*
MAIL.Waiting = 1
Mail.WaitingForDeliveryRetry = 1
MAIL.WaitingForDIR = 0
MAIL.WaitingForDNS = 0
MAIL.WaitingRecipients = 1
5 statistics found
24
What We’ll Cover …
•
•
•
•
•
•
•
•
•
•
•
Looking at the big picture of server monitoring
Understanding statistic generation
Designing an efficient and sensible collection infrastructure
Pulling useful information from statistical data
Using cluster stats to keep clusters reliable
Understanding the essentials of event monitoring
Determining the best notification methods
DDM: Understanding how it fits into your environment
DDM: Crafting a perfect DDM data collection hierarchy
DDM: Looking at DDM events and probes
Wrap-up
25
Take It to the Next Level
•
•
Now that we know where the statistics are, it’s time to kick it up a
notch
Let’s set up a collection architecture
Some Notes shops do not collect server statistics at all!
How in the world can they:
Determine what is causing performance issues?
Plan for future growth?
Have a grip on whether their server platforms are configured
correctly?
Do they just make the stuff up and go with it?
26
The Two Things Needed
•
There are two things that are needed for statistics collection to
happen:
The Events4 database must have a Server Collection document
The Collect task must be running on the server that is
designated to collect the statistics
27
Details, Details, Details
•
Events4, the Monitoring Configuration database, needs a
Statistics Collection document for each server collecting stats
This database should replicate to every server in the domain
A server will know it is supposed to collect stats because of
this document
But it won’t automatically load the collect server task
We have to make sure that happens
28
Server Statistics Collection Docs
•
Use a Server Statistic
Collection doc to indicate the
server that will collect stats
And the servers you want
the stats collected from
29
Set the Statistics Collection Interval
•
•
Use the collection report interval on the Options tab to set up how
often statistics should be gathered
Generally, collecting once an hour is sufficient
If you are upgrading or changing the environment, it’s better to
collect every 30 minutes
Or even every 15 minutes, if you are trying to fix problems
30
A Single Document Looks Like Many in the View
•
This single document, with a multi-value field containing all the
servers, will look like it is multiple documents in the Events4
database
Make sure administrators know this, or they might delete
everything by mistake
Guess how I know this?
31
Centralize Your Domain’s Statistic Collection
•
•
•
Ideally, use just a few key servers to do the collection
You might even be able to get away with just one!
Your network topology will have a profound effect on which
servers you select
So will the load currently running on the servers
Avoid collecting stats over long, slow links
Be careful of WAN routes that are already packed with other
network traffic
32
Configure Key Collect Points
•
•
If you have offices in London and Tokyo, then pick a collection
server from each city
That server will collect stats from all servers in that region
Collect stats in a database created from the Monitoring Reports
template
The databases don’t have to be called Statrep
Voilà! Centralized data at your fingertips
City
Collecting
Server
Monitoring Results
Database
London
Tokyo
LonAdmin1
TokHub01
LondonStatrep.nsf
TokyoStatrep.nsf
33
Remember to Add the Collect Task
•
•
The Collect server task must be running on the servers you
selected as collectors
Use LOAD COLLECT from the console to get it started
Add the Collect task to the ServerTasks= line in the selected
servers’ Notes.ini to make it permanent
Remove Collect from ServerTasks= from all other servers!
Want the servers to start collecting stats immediately?
Use the following console command:
Tell Collector Collect
It will kick off a statistic collection of all the servers
you specified
34
The Collect Task Should Not Run on Every Server
•
Stat collection can be set up so each server collects its own stats
And puts them into a local Statrep Monitoring Results database
This method has the following drawbacks:
You have to run the Collect task on every server
You must visit Statrep on each server to analyze statistics
•
This is a real pain in the neck
And it makes analysis harder
Statistics have the most value when collected into a central
location where they can be easily analyzed
35
Demonstration: Setting Up the Collect Task
Demo
36
What We’ll Cover …
•
•
•
•
•
•
•
•
•
•
•
Looking at the big picture of server monitoring
Understanding statistic generation
Designing an efficient and sensible collection infrastructure
Pulling useful information from statistical data
Using cluster stats to keep clusters reliable
Understanding the essentials of event monitoring
Determining the best notification methods
DDM: Understanding how it fits into your environment
DDM: Crafting a perfect DDM data collection hierarchy
DDM: Looking at DDM events and probes
Wrap-up
37
Let’s Start by Looking at Disk Stats
•
•
If I get a call about server performance, I check disk stats first
Bad disk utilization can seriously tank a server
One stat to track is Percent Utilization
A very busy disk can mean a very busy server
But it might mean something else is wrong
•
Perhaps a controller is beginning to fail or drive cache is wrong
Disk stats names depends on platform, but have PctUtil in them
It could be Logical Disk or Physical Disk
Like Platform.LogicalDisk.1.PctUtil.Avg
This should rarely hit 60% on Wintel boxes
On AIX and iSeries, it depends on disk sub-systems config
They often can run 90%+ without issues
38
Average Disk Queue Length
•
•
•
•
This is a major statistic!
Platform.LogicalDisk.1.AvgQueueLen, .Avg and .Peak
Queues of more than a couple of seconds mean your disks
can’t really keep up with the action
You can hit high peaks occasionally without issues
But constant highs mean moving users or apps
Balance these disk stats against CPU/Memory stats
Because memory = virtual disk
And constant thrashing of disks might mean you need more
RAM
Problem is, Statrep doesn’t have a view that shows these
important statistics
39
There’s a Lot of Stuff That Isn’t There
•
•
Before we get any further, it’s important to point out something
that is hidden
Statistical data – In the Monitoring Reporting database
STATREP.NSF
Statrep has views that simply don’t have data that is as useful as
it could be
It’s there, it’s just not in views
However, it’s important to know that every document in the
database contains every statistic you see when you issue a
SHOW STAT command at the console
It’s just a matter of showing it in a view
40
Take Home This View
•
But now you have a version of Statrep with a view that does
contain those important stats!
A specially-crafted version of the Statrep template with a view
like the one below is available
You can download it from my blog
You’ll probably have to modify the columns based on the disk
configurations of your own systems
41
Processor Statistics
•
Platform.Memory.RAM stats will disclose memory usage
Don’t just think you might need more memory: be certain by
checking this out
On Wintel systems, this number should rarely be 60%
But on iSeries and AIX, it can be much higher
On iSeries it can actually run quite nicely at 90%
42
CPU Stats Are There for Each Task
•
•
Platform.Process.ActiveDomino.TotalCpuUtil
Gives you the big picture of how Domino is using processors
There is a Platform.Process.$$$.PctCpuUtil stat for each task you
run on your Domino servers
Platform.Process.Amgr.PctCpuUtil
Platform.Process.Router.PctCpuUtil
Platform.Process.Process.PctCpuUtil
Platform.Process.Amgr.PctCpuUtil
… And so on
43
Using These Stats
•
•
You might find that the Agent Manager is the biggest hog because
of user personal agents!
You could move busy user agents to a different server
These stats don’t show in the Lotus version of Statrep
But they are on the Technotics85Statrep.NTF version
You can download it from my blog
www.andypedisich.com
44
What We’ll Cover …
•
•
•
•
•
•
•
•
•
•
•
Looking at the big picture of server monitoring
Understanding statistic generation
Designing an efficient and sensible collection infrastructure
Pulling useful information from statistical data
Using cluster stats to keep clusters reliable
Understanding the essentials of event monitoring
Determining the best notification methods
DDM: Understanding how it fits into your environment
DDM: Crafting a perfect DDM data collection hierarchy
DDM: Looking at DDM events and probes
Wrap-up
45
Why Wouldn’t the Failover Replica Be Up to Date?
•
•
•
When primary server is down, users are directed to a replica on a
failover server
But sometimes that replica is not up to date
Cluster replication keeps primary server in sync with failover
It’s an event-driven process – occurs automatically when a
change is made to a database
Changes to a database are pushed to the replica on failover
Deletion stubs are not replicated
That’s why you also need a scheduled replication doc
between servers in a cluster
It’s vital that these replicas are synchronized
But by default, clusters only have 1 cluster replicator task
46
Not Now … I’m Too Busy
•
•
•
Occasionally, there is too much data changing to be replicated
efficiently by a single cluster replicator
If cluster replicators are too busy, replication is queued until
more resources are available
Your databases get out synch and stale
Adding a cluster replicator will help fix this problem
Use this parameter in the Notes.ini
CLUSTER_REPLICATORS=#
But how do you tell if there’s a potential problem?
Adding too many cluster replicators will have a negative effect
on server performance
47
Key Stats for Vital Information About Cluster Replication
Statistic
What It Tells You
Acceptable Values
Replica.Cluster.
SecondsOnQueue
Total seconds that last DB
replicated spent on work
queue
< 15 sec – light load
> 30 sec – heavy
Replica.Cluster.
SecondsOnQueue.Avg
Average seconds a DB spent Use for trending
on work queue
Replica.Cluster.
SecondsOnQueue.Max
Maximum seconds a DB
spent on work queue
Replica.Cluster.
WorkQueueDepth
Current number of databases Usually Zero
awaiting cluster replication
Replica.Cluster.
WorkQueueDepth.Avg
Average work queue depth
since the server started
Use for trending
Replica.Cluster.
WorkQueueDepth.Max
Maximum work queue depth
since the server started
Use for trending
Use for trending
48
What to Do About Stats Over the Limit
•
•
Acceptable Replica.Cluster.SecondsOnQueue
Queue is checked every 15 seconds
Under light load, should be less than 15 seconds
Under heavy load, if the number is larger than 30, another
cluster replicator should be added
If the above statistic is low, and Replica.Cluster. WorkQueueDepth
is constantly higher than 10
Perhaps your network bandwidth is too low
Consider setting up a private LAN for cluster replication
traffic
49
Stats That Have Meaning but Have Gone Missing
•
There aren’t any views in the Lotus version of Statrep that let you
see these important statistics
Matter of fact, the Clusters view is pretty worthless
50
Stats That Have Meaning but Have Gone Missing (cont.)
•
But there is a view like that in the Technotics85Statrep.ntf
It’s just a download from my blog
It shows the key stats you need
To help track and adjust your clusters
51
My Column Additions to Statrep
•
This slide explains the formulas I used in the view
The important thing is that I convert seconds to minutes
You are shown the major delays
Column Title
Formula
Formatting
Min on Q
Replica.Cluster.SecondsOnQueue / 60
Fixed (One Decimal
Place)
Min/Q Av
Replica.Cluster.SecondsOnQueue.Avg / 60
Fixed (One Decimal
Place)
Min/Q Mx
Replica.Cluster.SecondsOnQueue.Max / 60
Fixed (One Decimal
Place)
WkrDpth
Replica.Cluster.WorkQueueDepth
General
WD Av
Replica.Cluster.WorkQueueDepth.Avg
General
WD Mx
Replica.Cluster.WorkQueueDepth.Max
General
52
Demonstration: Looking at Technotics85Statrep.ntf
Demo
53
What We’ll Cover …
•
•
•
•
•
•
•
•
•
•
•
Looking at the big picture of server monitoring
Understanding statistic generation
Designing an efficient and sensible collection infrastructure
Pulling useful information from statistical data
Using cluster stats to keep clusters reliable
Understanding the essentials of event monitoring
Determining the best notification methods
DDM: Understanding how it fits into your environment
DDM: Crafting a perfect DDM data collection hierarchy
DDM: Looking at DDM events and probes
Wrap-up
54
Event Monitoring Depends on Events4
•
•
We want to monitor all of our servers and be notified when certain
conditions occur
We will specify what we want to watch for and how to be
notified
We will use the Events4 database to configure all of this
The Events4 database must have the same replica ID on
every server in your domain
I have found many, many cases where Events4 was not the
same replica everywhere in the domain
That ruins the monitoring architecture
The monitoring configuration, and the config for alerts and
notifications, can’t replicate to some servers
55
We Know What the Replica ID Should Be for Events4
•
The replica ID of system databases such as Events4 are derived
from the replica ID of the address book
Database
Replica ID
NAMES.NSF
852564AC:004EBCCF
CATALOG.NSF
852564AC:014EBCCF
EVENTS4.NSF
852564AC:024EBCCF
ADMIN4.NSF
852564AC:034EBCCF
Notice that the first two numbers after the colon for the
Events4.nsf replica are 02
Determine your address book’s replica ID, and you’ll know
the replica ID of Events4
56
Verify Events4
•
•
•
You must verify that every server has the same replica of Events4
You can find this info in the catalog, if your catalog architecture
is getting file info from all servers
Or you need to go to every server and open Events4
It’s vital that you validate Events4
If it is not right on a server, then you must down the server, delete
Events4, and restart your server
The correct Events4 will be re-created automatically
Make sure that EVENTS4.NSF is the same replica ID throughout
the domain by opening a copy from every server and putting it on
your desktop
Here’s some code to help you do that
57
Add a Button to Your Toolbar
•
Add this code to a button on your toolbar
This is courtesy of Thomas Bahn
He’s a smart guy, nice guy, and sometimes brings chocolates
to his friends from Europe
www.assono.de/blog
_names := @Subset(@MailDbName; 1) : "names.nsf";
_servers := @PickList([Custom]; _names; "Servers"; "Select
servers"; "Select servers to add database from"; 3);
_db := @Prompt([OkCancelEdit]; "Enter database"; "Enter the file
name and path of the database to add."; "log.nsf");
@For( n := 1; n <= @Elements(_servers); n := n + 1;
@Command([AddDatabase]; _servers[n] : _db) )
58
Add a Database Icon from All Servers to the Desktop
•
•
This code will prompt you to pick the servers that have the
database you want on your desktop
Then it will prompt for the name of the database
And open it on all the servers you’ve selected
Use it to make sure all the EVENTS4.NSF are the same replica in
your domain
59
Now, on to Event Monitoring
•
Domino can monitor for just about any condition:
It can watch for a statistical threshold
Free disk space under a certain value
Mail.waiting over a certain value
It could be some non-statistical event in the log
An Agent that doesn’t have enough time to run
And might be in a loop
A corruption problem with a database that is preventing
replication
A user connecting with an unsupported version of Notes
60
What Happens Then?
•
•
When a certain statistical condition or log entry occurs, Domino
can do a bunch of different things
Capture and store the event in a database
Notify someone that the event happened
Log the event to a Tivoli console
And lots more that we will discuss in a few minutes
For now, let’s focus on capturing the events
61
Event Monitoring
•
•
Event monitors of all types are set in the
Events4 database
Two broad categories of events:
Event Handlers
Specify the action that Domino takes
when a specific event occurs
Event generators
Each type of event generator has a
view that provides a list of all event
generators, plus additional
configuration information
62
Event Generators
•
•
We’ll look at event generators first
They deal with specific Notes/Domino issues
There are six types of event generators:
Database Event Generator
Domino Server Response Event Generator
Mail Routing Event Generator
Statistic Event Generator
Task Status Event Generator
TCP Server Event Generator
Some are used more than others
We’ll stick to the more popular ones
63
Database Event Generator
•
Use Database Event
Generators to monitor:
Database activity
Free space
Frequency and success
of database replication
ACLs
And get reports
on ACL changes
Including those made
by replication or an
API program
Monitor specific servers or
every server in the domain
64
Here’s One That Everyone Should Use
•
The ACL of Names.nsf should be monitored for changes in every
Notes domain
Once properly set, the ACL of Names.nsf should rarely change!
All kinds of bells and whistles should go off when it does
Remember, we’ll talk notification in a moment
Here’s how to set up the monitoring of the ACL
Select New Database Event Generator
65
Here’s One That Everyone Should Use (cont.)
•
•
Select Names.nsf
You can choose either a single
server, such as the administration
server for the address book, OR
All servers in the domain
I like to pick all servers in the domain
Admins won’t get away with
anything!
But I do get a storm of messages
when an ACL change occurs
Every server tells me about
the change
66
Monitoring Replication
•
Replication monitoring is somewhat useful
You can set a time interval in which you expect some
replication to occur
Just remember that it will report no replication occurred even
if there was nothing to replicate
This can be confusing since it might produce a report that
looks like an error occurred even though nothing is wrong
67
Other Database Event Generators
•
The unused space and user inactivity might have value in very
specific situations
You can run compact, but who wants compact running
anytime?
And compact is generally run on a schedule anyway
You can be notified when a DB is not used
But activity logging is much better at this because it can deal
with all databases on all servers
68
Server Response Generator
•
•
•
•
Domino Server Response Event Generator
Checks connectivity/port status of server’s network
One server checks others by sending a probe
It’s a good idea to try opening Names.nsf
If you can’t open Names.nsf, then something is wrong!
Set interval for checking Names.nsf – default is 3 minutes
Set response time tolerance – Default is 1,000 Msecs (one second)
These will both depend on your own environment
69
More About Probes
•
•
The default response time is a bit on the harsh side
If left at one second, you’ll get lots of notifications
You should make it ten seconds or whatever the metrics in
your Service Level Agreement (SLA) requires
Also, be careful what servers you choose to probe other servers
Try to pick probing servers that are in the same LAN as the
probed servers
Otherwise, your probing will be testing network latency
rather than the servers themselves
70
Mail Routing
•
Mail Routing Event Generator
Sends a mail-trace message to a particular user’s mail server
Gathers statistics indicating the amount of time, in seconds,
it takes to deliver the message
Great for troubleshooting
Generally not used day to day
71
Statistic Event Generators
•
Statistic Event Generators monitor a specific Domino or platform
statistic
They can let you know when a stat goes over a particular
threshold
These stat event generators are extremely valuable
Smart administrators use them every day!
72
Default Settings for Stats Event Generator
•
Many are set by default for all servers in the domain
Review these to see if they apply to your enterprise
73
Task Status
•
Task Status Event Generator is another interesting
troubleshooting tool
It monitors the status of the Domino server and add-in tasks
74
TCP Server Events
•
•
The TCP Server Event Generator verifies the availability of
Internet ports (TCP services) on servers
This also needs ISpy to work – put this in servertasks=
Load runjava ISpy
Case sensitive!
A valuable concept for
some servers
But not widely used
75
Checks the Ports for You
•
It generates a statistic indicating the amount of time, in
milliseconds, it takes to verify that the server is responding on the
specified port
Each port you select has a tab where you can sometimes set
special characteristics about the probe
76
Event Handlers — My Favorite!
•
•
We have worked pretty hard to get to this point:
Understood how statistics are generated
Identified stats important to the stability and performance
of servers
Set up a statistic collection infrastructure
Now we have the moment of truth
The event handlers!
77
Event Handlers — My Friends
•
An Event Handler defines the action that Domino takes when a
specific event occurs
Choosing the right action is critical to your organization
Some serious events should cause a page to be sent to the
person on call
Other events might merely cause an email to be sent
•
It all depends on what’s important to the business
We’ll talk about how you are notified in a moment
First, let’s review the awesome power of the Event Handler
78
Event Handling Options
•
Just like event generators,
you can include all servers in
the domain or just a few
This lets you target servers
with “issues”
79
Getting Trigger Happy
•
The notification trigger is where it’s at:
Any event matching a criteria
A wide-open trigger for any problem, statistical or something
that just shows in the log
A built-in or add-in task event
Looks for an event generated by a Domino task
A custom event generator
An event generator that you created
80
Event Selection Criteria
•
You can select:
A certain type of event
Different types of severities
Or track a particular message that is appearing in the log
81
Demonstration: Working with Event Generators
Demo
82
What We’ll Cover …
•
•
•
•
•
•
•
•
•
•
•
Looking at the big picture of server monitoring
Understanding statistic generation
Designing an efficient and sensible collection infrastructure
Pulling useful information from statistical data
Using cluster stats to keep clusters reliable
Understanding the essentials of event monitoring
Determining the best notification methods
DDM: Understanding how it fits into your environment
DDM: Crafting a perfect DDM data collection hierarchy
DDM: Looking at DDM events and probes
Wrap-up
83
Notification Method Selection
•
You can choose from a wide variety of notification methods
Some are better than others
And you can easily enable, disable, or select a time span for
notification
Perhaps only page certain numbers overnight but all
admins during the day
84
Notification Options
Method
Result
Broadcast
Reports the event to all users logged onto the server or to a
specified group of users
Log to Database
Logs the event to a database, typically STATREP.NSF
Mail
Mails the event to a person or to a mail-in database
Log to NT Event
Viewer
Reports the event to the Windows NT Event Viewer
Pager
Uses the mail address of an alphanumeric pager
Relay
Relays the event to another server that is in the same Domino
domain and that runs a common protocol
Run an Agent
•
Runs a specified agent based on the configured Event
Handler
• Use this method to resolve an issue without user
intervention
• You specify agent name, the server and database containing
the agent, and any parameter to pass to the agent
85
Notification Options (cont.)
Method
Result
Run Program
Runs an add-in program or specified command to correct
problems automatically
Send a Console
Command to the Server
Sends a console command, or commands, to the server
according to the Event Handler that was configured
Sound
Sounds an alarm on the designated server when the event
occurs
UNIXLog
Reports the event to the UNIX system log
Run Program
Runs an add-in program or specified command to correct
problems automatically
Send a Console
Command to the Server
•
Sends a console command, or commands, to the server
according to the Event Handler that was configured
• You can specify the server console commands to run
86
These Two Are the Best Ones to Use
Method
Result
Comments
SNMP Trap
Sends the event as an SNMP trap. Select this
method only if the specified server is running the
Event Interceptor task and the Domino SNMP
Agent.
This is truly an ideal
notification method because
it does not depend on Notes
protocols actually working
Forward event to
Tivoli Event
Console
Allows the Tivoli Enterprise Console (TEC) to
receive IBM Domino events and reformat them as
TEC events. The reformatted TEC event is then
sent to the TEC server that you specify in the
Configuration Settings document.
Check with the Tivoli team to
see if it’s possible to use this
in your environment
87
Demonstration: Event Notification
Demo
88
Notification Methods Pros and Cons
•
•
Any notification method that involves Notes mail has limitations
If the Notes mail system is down, you won’t get notified
You especially won’t be notified about the mail system being
down or a router has hung
Do not use a configuration where the server is to let you know via
email when mail is backed up
The message that is being sent to you will be placed in
the queue
You won’t know about the problem until it is too late
Is s u e
89
Paging Dr. Howard, Dr. Fine, Dr. Howard …
•
•
A paging notification is a good choice
But not if you are paging through a third-party phone system
like Verizon or AT&T
They generally require an email to be sent (see
previous slide)
They have no Service Level Agreement – NONE!
Sadly, due to budget and resource constraints, we generally see
these two mail or paging methods used the most in production
environments
Caution
90
The Best Notification Method … Also the Most Complicated
•
•
The best notification methodology is to go outside of Notes
protocols to SNMP or a similar external source
SNMP is Simple Network Management Protocol
There are SNMP agents that must be started on an OS level
They are different for every major platform
There are special considerations for partitioned servers
It is a complicated solution, but once in place, it has extreme
value
And you never have to rely on an application layer
solution again
91
Log or Relay No Matter What Else You Do
•
If an event is worth tracking, it should always be placed in a
STATREP.NSF
You can use the LOG option, which is used when you want to
capture the event in each server’s Statrep
You can use the RELAY option to send to a Statrep that is
centrally located
92
Some Tricks of the Trade
•
•
•
When problems occur, they are almost always in the server log
That means you can catch them with an Event Handler
A great way to do this is to look for specific text in the message
That makes it very flexible
Log the results into a separate database to make analysis and
investigation easier
And you can create multiple events to take multiple
actions if necessary
Let’s look at a couple examples of this
93
A Good Example of Looking for Text
•
When someone enables Full Access Administrator, a message
shows up in the server log
You’ll definitely want to audit this when it occurs
If it’s in the log, that means you can grab it
94
A Good Example of Looking for Text (cont.)
•
If you wanted to be notified every time someone turns on Full
Access Administrator, you could look for the following string
“full administrator access”
Set up a notification to log to Statrep
And another notification to mail it to you so you always know
who is using this powerful privilege
95
Is Your SMTP Server Under Relay Attack?
•
If you’re interested in the safety of your SMTP server, you might
want to know when bad guys attempt to use it as an “open relay”
When that happens, you’ll see something like this in the
server log
Remember, if it’s in the log, that means you can grab it
96
Just Log It – But in a Special Database
•
In this case, you don’t really want to be notified right away
You just want to know when it happened
To make it easier to analyze, place the logged entries into a
separate database
D o n 't
F o rg e t
97
What We’ll Cover …
•
•
•
•
•
•
•
•
•
•
•
Looking at the big picture of server monitoring
Understanding statistic generation
Designing an efficient and sensible collection infrastructure
Pulling useful information from statistical data
Using cluster stats to keep clusters reliable
Understanding the essentials of event monitoring
Determining the best notification methods
DDM: Understanding how it fits into your environment
DDM: Crafting a perfect DDM data collection hierarchy
DDM: Looking at DDM events and probes
Wrap-up
98
Why DDM Is Awesome
•
DDM is a single location where administrators can access issues
that are affecting multiple servers and databases
DDM database is a central repository of all monitoring data
Data collected by probes that you can configure
Result messages from event generators that you configured
99
Do All Administrators Use DDM?
•
•
•
Many administrators don’t use the potential of DDM as much as
they should
Already overwhelmed by the monitoring features of Domino
Don’t understand how DDM fits into the architecture
Some Administrators just have Probe-Aphobia
But you don’t have to use probes to use DDM
Probes are not a required part of DDM
They are nice to have and fun to use, but DDM functions
without them
You can get started without probes
Then add them into the configuration when you become more
familiar with how DDM works
100
The Big Relationships in Monitoring
•
EVENTS4.NSF – the Monitoring Configuration database is a key
file in your monitoring infrastructure
It also contains all of the specifics for your DDM monitoring
configuration
For DDM probes
For the DDM collection hierarchy – which must be set by you
101
We Know What the Replica ID Should Be for EVENTS4
•
•
The replica ID of system databases, such as EVENTS4 and
DDM.NSF, is derived from the replica ID of your
domain’s address book
Database
Replica ID
NAMES.NSF
852564AC:004EBCCF
CATALOG.NSF
852564AC:014EBCCF
EVENTS4.NSF
852564AC:024EBCCF
ADMIN4.NSF
852564AC:034EBCCF
DDM.NSF
852564AC:0A4EBCCF
Notice that the first two numbers after the colon for the
EVENTS4.NSF replica are:
02 for EVENTS4 and OA for DDM.NSF
102
Errors You Might See If DDM.NSF Is Not Right
•
•
If there is a DDM.NSF on every server but they aren’t all the same
replica ID, you’ll see the following error on the console every
couple of minutes:
Unable to replicate with server Server2: None of the selected
databases have a replica on the server
You’ll get the error even if there is no connection document
You’ll get this error even if there is a connection document
and you have a much longer replication interval scheduled
To fix problems related to EVENTS4.NSF and DDM.NSF replica
IDs, you must delete the bad DDM databases and restart the
server
DDM.NSF will be recreated automatically
103
What We’ll Cover …
•
•
•
•
•
•
•
•
•
•
•
Looking at the big picture of server monitoring
Understanding statistic generation
Designing an efficient and sensible collection infrastructure
Pulling useful information from statistical data
Using cluster stats to keep clusters reliable
Understanding the essentials of event monitoring
Determining the best notification methods
DDM: Understanding how it fits into your environment
DDM: Crafting a perfect DDM data collection hierarchy
DDM: Looking at DDM events and probes
Wrap-up
104
Configure DDM for Centralized Data Collection
•
•
•
DDM.NSF has most value when it’s a centrally located repository
It will contain all of the issues that come from all of the servers
This does not happen on its own
There is no collection hierarchy set up by default
Each server collects its own DDM data in its own DDM.NSF
If your DDM hierarchy looks like below, you need to set it up
105
Collection Hierarchy Is a Must
•
•
Without a collection hierarchy, DDM probes run on a server and
report events to DDM.NSF that are on that server
Then they remain only on that server’s replica of DDM.NSF
You have to check the DDM database on each server to evaluate
problems and discover potential issues
This is time consuming and is contrary to the design
It reduces time you could be spending solving problems
And it’s a big pain!
Which means you’ll never use it
106
Aggregate Data Centrally
•
•
A DDM server collection hierarchy lets you aggregate the data
onto a key server or servers
This must be configured in the EVENTS4.NSF
The simplest hierarchy is to configure one server to collect from
all servers in the domain
I totally recommend this to get you started
107
More Complex Scenarios Are Possible
•
Perhaps as you become more familiar with DDM, you’ll want to
roll up some data regionally
So that regional administrators receive only information that is
pertinent to the server they maintain
108
Rolling Up the Data
•
DDM data rollup propagates the probe results up the DDM server
collection hierarchy
Data rollup is accomplished using Domino’s selective
replication to transport the data
The replication formulas are created automatically when you
define your DDM server collection hierarchy
109
Hierarchy Collection Interval
•
•
The DDM system sets up its own collection interval
Collection replication occurs about every five minutes
This interval cannot be modified
It is not controlled through connection documents
Every five minutes, each collection server uses pull replication to
get updates from the DDM database on each monitored servers
110
What We’ll Cover …
•
•
•
•
•
•
•
•
•
•
•
Looking at the big picture of server monitoring
Understanding statistic generation
Designing an efficient and sensible collection infrastructure
Pulling useful information from statistical data
Using cluster stats to keep clusters reliable
Understanding the essentials of event monitoring
Determining the best notification methods
DDM: Understanding how it fits into your environment
DDM: Crafting a perfect DDM data collection hierarchy
DDM: Looking at DDM events and probes
Wrap-up
111
Address Issues by Severity Level
•
Looking at issues by severity gives you the chance to deal with
the most important issues first
They are broken out by severity category
112
Another Helpful View
•
Release 8 added a new view to DDM.NSF
You can see issues by database name
This lets you determine whether a problem is happening
on just one server or on every copy of the database in
the domain
Very handy information when problem solving
113
Simplest Way to Use DDM
•
I consider the By Date view to be very helpful
Assign a junior Admin to check DDM events every day
Have the Admin go through all the events and fix problems
114
Working with a DDM Event
•
It’s a great monitoring tool because it smartly tells you
The task that reported it
The severity and type
And it would be pretty good even if it just did that
115
Working with a DDM Event (cont.)
•
It also suggests
Probable cause, possible solution
And very often offers a corrective action
Some of which are automated routines
116
Each DDM Event Has Common Actions
•
Such as
Open the server document or server log
View the server’s NOTES.INI
Opening a remote console
Opening the DB with the Designer client
And other actions depending on the error that occurred
117
DDM Has Powerful Probes
•
•
•
A probe is the investigative component of DDM
Probes:
Need configuration to be useful
Are configurable by administrators
A probe is an action configured to run against one or more
servers, databases, and services
A probe returns its status and results to the Domino Domain
Monitoring Database – DDM.NSF
118
Analysis Probes
•
•
Configure DDM using probe documents in EVENTS4.NSF
Otherwise known as the Monitoring Configuration database
You can create multiple probes for each feature area
And you can individually configure each probe to run:
Selective checks
Against specific servers and/or databases
At specific times
119
What’s in These Probe Documents?
•
Probe type and probe subtype
For example, Security is a probe type
One of its probe subtypes is Best Practices
This combination of probe type and probe subtype creates
a Security probe
120
Extra Information About the Probe Is Provided
•
These probe documents also contain a general description of the
probe, its purpose, and its intended use
121
Configuration Specifics, Too
•
Documents can also specify configurable probe targets
The server(s) that will run the probe
And in some cases, the servers, database, etc., that the probe
runs against
Where it’s applicable, there is also configurable scheduling
information – but not for all probe types
122
There’s More Inside
•
Probe documents can also hold configuration specifics
What the probe monitors
What it should report on
Thresholds to watch for
And what type of severity those thresholds represent
Plenty of Cool Probes
•
•
•
R8 gives us 58 default DDM probes to work with
R7 gives us 48 – still plenty to get us started
You can get probing as soon as R7/R8 is up
Just plug in your server info to get DDM started
You can also create new probe documents
Define and customize your own probes
124
Many Types of Probes
•
•
There are ten major types of probes in R8, nine in R7
These probes can run two different ways:
On a schedule that you specify
As an active monitor of things that happen in the domain
Some probes can run either way
On a schedule or as a monitor
It depends on what you ask them to do
Some can only run as a monitor or on a schedule
125
Establishing a Schedule
•
Scheduled probes can be controlled with great granularity
Set the probe to run:
Daily, Weekly, Monthly
Beyond that, specific schedule settings can vary from
probe type to probe type
126
Don’t Worry About Getting Off Schedule
•
If a Weekly/Monthly probe is missed, you can specify how you
want the probe to be handled:
Ignore it completely
Run the missed probe on startup
Run the missed probe at the next time range
127
Zeroing in on Probes
•
We’re going to focus on two probes that have a high value in
almost every Domino domain:
Application probes
Security probes
128
Agents Are Tracked by Application Probes
•
•
•
•
Application probes monitor agents in real time
Agents behind schedule
Detects when an agent starts after its scheduled time
Long-running agents
Agents ranked by CPU usage
Evaluates the CPU usage for agents executed by Agent
Manager or HTTP
These have a relatively high overhead
129
Agents Are Tracked by Application Probes (cont.)
•
•
Agents ranked by memory usage
Evaluates agents memory usage executed by the Agent
Manager or HTTP tasks
Note that evaluation results for the same agent may differ when
the agent runs in Agent Manager/HTTP
Also, results from this probe can depend on HTTP settings
Long-running agents
Detects agents that run longer than a time you specify
130
The Five Security Check Probes
•
Security probes assess the overall security of servers and
databases in your domain
Best Practices
Compares a set of baseline security configuration settings to
the same settings in a domain
Configuration
Compares settings in a specific Server document to settings
in a specified “good” Server doc
This doc can be real or built by you as an example
131
The Five Security Check Probes (cont.)
•
•
Database ACL
Monitors the access control privileges that groups and
individuals have in specified databases
You designate the acceptable access levels on the
Specifics tab
Database Review
Reviews the security
properties for a specified
database
Generates a report on probe
findings
132
The Five Security Check Probes (cont.)
•
Security Review
Generates a report on the security settings specified in the
Specifics tab of the probe document
You have the option of selecting the “Directory Profile Note”
and the “Security Settings in the Server Configuration
Document”
And a review of all security settings in a Server doc
This can really help to tighten your domain’s security
133
Using the Assign Button
•
You can assign the event to a team member and add comments
about the task using the “Assign” button
Or you can simply assign the event to yourself
134
Changes Are Tracked
•
All changes you make to the event are tracked in the Event
Change History for easy reference
Finally, there is an easy, built-in process for tracking problem
resolution in your environment
135
Demo
136
What We’ll Cover …
•
•
•
•
•
•
•
•
•
•
•
Looking at the big picture of server monitoring
Understanding statistic generation
Designing an efficient and sensible collection infrastructure
Pulling useful information from statistical data
Using cluster stats to keep clusters reliable
Understanding the essentials of event monitoring
Determining the best notification methods
DDM: Understanding how it fits into your environment
DDM: Crafting a perfect DDM data collection hierarchy
DDM: Looking at DDM events and probes
Wrap-up
137
Some of My Other Statistics Sessions to Consider
•
•
•
It’s like an extension of this Jumpstart …
Advanced server monitoring and alert notifications
Friday morning at 9:45 am
Don’t miss my Hands-On Lab
Drilling Down into Domino Statistics
Wednesday 4:00 to 6:00 pm
Thursday 1:30 to 3:30 pm (right, it’s not am)
Session goes into detail about pulling statistical data into
spreadsheets for analysis with pivot tables and graphics
And you have the opportunity to get your hands dirty actually
making the graphics yourself
Hope to see you there!
138
Where to Find More Information
•
•
•
•
•
www-1.ibm.com/support/docview.wss?uid=swg27007060
“Lotus Education on Demand: Domino Domain Monitoring
(DDM)” (IBM, 2010).
www.ibm.com/developerworks/lotus/library/stats-linux/
Joe Malek, “Lotus Domino Platform Statistics on Linux”
(developerWorks, 2004).
www-1.ibm.com/support/docview.wss?uid=swg21139259
“Configuring Multiple Cluster Replicators on a Domino Server”
(IBM, 2011).
www-1.ibm.com/support/docview.wss?uid=swg21099635
“Which Domino Server Databases Have Replica IDs Related to
the NAMES.NSF?” (IBM, 2012).
www.andypedisich.com
Download presentations and technotics85Statrep.ntf
139
7 Key Points to Take Home
•
•
•
•
Run the Collect task on servers located centrally
Don’t run it on every server
Let cluster statistics be your guide in determining the number of
cluster replicators
A great technique for problem solving is to capture log entries
using Events4 and put them into a special Statrep for easy
examination
Make sure you have the correct replica of Events4 deployed and
that it’s the same replica ID on every server
140
7 Key Points to Take Home (cont.)
•
•
•
Be careful using server probes over a WAN, or you’ll end up
testing the network rather than the servers
Start with a flat DDM data collection hierarchy and make it more
complex only if your requirements call for it
Make new administrators to check DDM every day and have them
assign problems they can’t fix to senior admins
141
Your Turn!
How to contact me:
Andy Pedisich
[email protected]
www.andypedisich.com
www.technotics.com
142