Windows Monitoring

Download Report

Transcript Windows Monitoring

5 years of vaporware



These slides represent the work and opinions
is not
theirconstitute
fault!
of the author Itand
do not
official
positions of any organization sponsoring the
author’s work It is not my fault!
This material has
not
been
peer reviewed and
It is
your
fault!
is presented here as-is with the permission of
the author.
The author assumes no liability for any
content or opinion expressed in this
presentation and or use of content herein.

Developer (not manager)

Accidentally ended up in our NOC

2003: The birth of NSClient++

2004: The open source of NSClient++

2007: The rebirth of NSClient++

2011: The Present
◦ Not working with Nagios
◦ Hated BB so we migrated to Nagios
◦ NSClient sucked (Broke Exchange)
◦ NRPE_NT was to much work
◦ “just for fun”
◦ Got a lot of emails and hits on the webpage
◦ 0.3.9 out last may
◦ 0.4.0 out as alfa

Windows Monitoring and NSClient++

What’s new in 0.3.9

What’s new in 0.4.0


◦ Quick Introduction
◦
◦
◦
◦
Disk/File/*
Scheduled Tasks
Aliases
Crash Handling
◦
◦
◦
◦
◦
New core
Unix support
New settings subsystem
New protocol
Python Scripting
The end of NSClient++!
Q/A
Quick Introduction

What is NSClient?
◦ A (pretty old) program
 pNSClient
 A (pretty limited) protocol
 check_nt
◦ A (pretty incorrect) concept
 ”Windows monitoring”

What is it not?
◦ NSClient++!
 NSClient++ was written as a replacement for pNSClient
 But it has evolved much since then

NSClient++
◦ Freedom!






Custom scripts
Decentralized or centralized
Active or Passive
Can monitor “anything” (including your application)
Can perform “tasks” (fix your problems)
Other options:
◦ SNMP
 Generally complex to use and limited on “standard” hardware
◦ pNSClient/NRPE_NT/OpMonAgent/*
 Old, outdated and usually limited functionality
◦ “Agentless” WMI
 Limited functionality
 Enforces centralized and active monitoring

But...
◦ I am biased, so might not want to take my word for it...
Protocol
Method
Encryption
Auth
Payload
M. args.
M. cmds
HTTP
NSClient
NRPE
NSCA
NRDP
NSCP
Active
Active
Passive
Passive
Active
Passive
No
No
Yes
Yes
Yes
Yes
No
Yes
Yes
No
1024
512
∞
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
No
No
No
Yes
Yes
MQ
Active
No
?
Yes
No
∞
Yes
No
Yes
Yes
No
No
DNSCP
check_mk
Configuration
Commands
Extensible
Yes
∞
∞

Internals:

Runs on:

Current Version:


◦
◦
◦
◦
C++
Around 75.000 lines of code
Actively developed (unfortunately only by me)
Modularized design (use what you need)
◦ Windows: NT4, w2k, XP, w2k3, Vista, w2k8, X64, X86 …
◦ Unix: Linux/Debian (probably many/most others as well)
◦ 0.3.9 with 0.4.0 in beta
Most features require NRPE or NSCA (or NSCP)
Documentation online (WIKI)
◦ http://nsclient.org

Not supported by a commercial entity
◦ Donations welcome
◦ Sponsoring available (contact me for details)

Used by a lot of people (I think)
◦ Impossible to estimate any figures

Please, Help out!
◦ Add documentation
◦ Report problems
◦ Come with ideas, thoughts, etc…
Using NSClient++

NSClient++ is a command line program!
◦ nsclient++ -start (net start nsclientpp)
◦ nsclient++ -stop (net stop nsclientpp)
◦ nsclient++ -test

nsclient++ -test
Configuration:
Is your friend!
◦ notepad nsc.ini

Testing:
1. Local (nsclient++ -test)
2. From CLI (check_nrpe ...)
3. From Nagios (add command)
Works with “anything”

◦
Including many non Nagios based systems

New command line syntax!

Testing

Configuration:

Run scripts:
◦ nscp --service --start
◦ nscp --service –-stop
◦ nscp --help
nscp --test
Is your friend!
◦ nscp --test
◦
◦
◦
◦
nscp --settings-help
nscp --settings --migrate-to ini
nscp --settings --set …
…
◦ nscp --client --module PythonScript --command
execute-and-load-python --script test.py --install
Overview

Major simplification to the disk/file checker
◦ CheckFile (removed)
◦ CheckFile2 Deprecated
◦ CheckFiles (replaces above)







Volume support (for real this time)
Aliases
NSCA/NRPE enhancements
Scheduled task checks
Crash Handling
A bunch of new commands
Bug fixes and many more things…



We have recruited a new member to the team!
A girl actually…
…Still a bit wet behind the ears…
CheckFile(1,2,s,…)

The good:
◦ Powerfull interface!
◦ Simple to use!
◦ out-of-the-box solution!
 (on which you can expand)

The bad:
◦ Nothing! Really, I mean it!

…and then… yesterday…
◦ …in the bar…
◦ …all hopes shattered…
◦ …aparently it is still to complicated… 


Same as was introduced for eventlog last year
Based on SQL WHERE clauses
◦
◦
◦
◦
◦
◦
generated > -2d AND severity = 'error‘
size > 5k
size > 5k OR size < 1k
size > 5k AND written > -2d
(size > 5k OR size < 1k ) AND written > -2d
…
Type
filename
Description
Name of the file
path
Path of the file
size
Size of the file
accessed
When the file was last accessed
written
When the file was last written
creation
When the file was created
version
The exe file version (slow)
line_count
Number of lines in the file (slow)
Operator
Safe
Meaning
=
eq
Equality
!=
ne
Not equal
>
gt
Greater then
<
lt
Less then
=>
ge
Greater then or equal
=<
le
Less then or equal
like
String similarity (substring matching)
not like
Opposit of like
regexp
Regular expression matching
Option
Description
path
The root path to use
pattern
The file pattern to use
filter
Define the filter (there can only be one)
warn
How many hits constitutes a warning state.
warn=>5, warn==5 warn=!=5
crit
How many hits constitutes a critical state.
truncate
Length of returned data.
Since NRPE/NSCA has a limited capacity this is
important. (Will be deprecated in 0.4.0)
syntax
How to format the return data
master-syntax How to format the “message string”
debug=true
Displays a lot more information in the logfile/console


CheckDriveSize … CheckAll=volumes …
Other new features
◦ Added a new option to ignore drives which are not
readable (like office 2010 q: drive)
 ignore-unreadable
◦ Added magic modifiers (from check_mk)
 magic=0.7
Scheduled Tasks

Works the ”same” as CheckEventLog
◦ ”filter=exit_code ne 0”

Two modules:
◦ CheckTaskSched.dll
 Works on Windows NT4 and beyond
 But cannot check ”new” tasks (from Vista and beyond)
◦ CheckTaskSched2.dll
 Works on Windows Vista and beyond
 Has fewer filter keywords
Type
Description
title
Tasks name
application
The application
comment
Retrieves the comment for the work item.
parameters
Retrieves the command-line parameters of a task.
working_directory
Retrieves the working directory of the task.
Retrieves the last exit code returned by the executable
associated with the work item on its last run.
exit_code
max_run_time
status
Retrieves the maximum length of time the task can run.
Retrieves the status of the work item. Possible values include:
ready, running, not_scheduled, has_not_run, disabled,
has_more_runs, no_valid_triggers
most_recent_run_time
Retrieves the most recent time the work item began running.
CheckTaskSched
"filter=exit_code ne 0"
"syntax=%title%: %exit_code%"
warn=>0
WARNING:test.job (1)
CheckTaskSched
"filter=status = 'running' AND most_recent_run_time < -30m"
"syntax=%title%
(%most_recent_run_time%)“
warn=>0
WARNING:test.job (2011-02-10 23:14:35)
Aliases

System
◦ alias_cpu
 CPU Load past 5 minutes, 80/90% bounds
◦ alias_cpu_ex
 CPU Load past 5 minutes, custom bounds
◦ alias_mem
 Memory utilization (all) 80/90% bounds.
◦ alias_mem_ex
 Memory utilization (all), custom bounds
◦ alias_up
 System uptime

Disk/Drive
◦ alias_disk
 All fixed drives
◦ alias_disk_loose
 All fixed drives, ignore any problematic drives
◦ alias_volumes
 All volumes
◦ alias_volumes_loose
 All volumes, ignore any problematic drives
◦ alias_file_size
 Check the size of a given file (filename, size)
◦ alias_file_age
 Check the age of a given file

Eventlog
◦ alias_event_log
 Check for errors in the event log

Schedules Tasks
◦ alias_sched_all
 No scheduled jobs have failed
◦ alias_sched_long
 No task has been running for longer then a given time.
◦ alias_sched_task
 Check if a given task succeeded

Misc
◦ alias_updates
 Check that updates are applied

Processes
◦ alias_service
 All services in “sensible state”
◦ alias_service_ex
 All services in “sensible state” (exclude various services)
◦ alias_process
 A process must be running
◦ alias_process_stopped
 A process must not be running
◦ alias_process_count
 A process must not have more then X instances
◦ alias_process_hung
 A process must not be hung
Crash Handling

Using Google break pad
◦ same as Google Chrome, Mozilla Firefox, etc

Three options (not mutually exclusive)
1. Send crash dumps to crash.nsclient.org
 Server can be changed

if you want to have an internal server or proxy server.
2. Store crash dumps for analysis
 Will also be checked with check_nscp
3. Restart service
[crash]
restart=1
service_name=nsclientpp
submit=0
url=http://crash.nsclient.org/submit
archive=1
#folder=<appfolder>/dumps
Miscellaneous Fixes

NSCA

NRPE

Checks

All filters (read CheckEventLog et al)

Process checks

Performance data
◦ Fixed problems with sending ”many” results back
◦ Added support for large payloads
◦ Added ”check_nscp” to check health of NSClient++
◦ Added new check for running other checks ”with a timeout”
◦ Added new negate check (to negate the result of another check)
◦ Many fixes and additions (regular expressions)
◦ Added support for checking if processes has ”hung”
◦ Added it to many places where it was intermittently missing
before
Whats to come?
0.4.3
0.4.2
0.4.0
• Core switch
0.3.9
• Last 0.3.x
• Linux
support
• Distributed
Monitoring
(v1)
0.4.1
• Monitoring
Kits
• Bugfixes
• New windows
checksubsytem
• True passive
checks
• Distributed
Monitoring
(v2)
•Bugfixes
Overview

Brand new core based upon libraries

Unix support

New settings subsystem

New protocol

Distributed monitoring

Python scripting

Updated installer
◦ Things should *work* not just “work”
◦ More modular and extensible
◦ Both as a client and server
◦ Registry, improved ini support, http, etc
◦ NSCP (HTTP(s), MQ, Native)
◦ Many new things in this area (including MQ)
◦ Primary goal (for me) is to create “unit-test”
◦ Wix 3.5, more customizable

“Monitoring Kits”
◦ Monitoring solutions for “standard things”

New windows check-subsytem
◦ More modern and less arcane (no NT4 support)
◦ Remote checking

.Net plugin support
◦ Possibly internal VBA scripting support

Metrics cache and aggregation
◦ Lightweight version of CEP
◦ “crit=cpu > 80% AND transactions_per_sec < 10”

Filter-like API (in addition to options)
◦ “warn=any drive > 90% OR c: > 80%”

Remote updates/upgrades
◦ Allow NSCP to upgrade itself

“port” of the “standard plugins”?
◦ Run your favorite check_xxx from inside NSClient++

Unix plugins?
◦ Run CheckCPU on unix machines?

Client/web Interface?
◦ A nice little program (systray)

Let me know what you would like to see!
Brand new core

This is why it was so long in the making
◦ Merging each new version took forever!

New internal protocol
◦
◦
◦
◦

Removed all internal “limits” (think buffer sizes)
Allows many new features
Allows much more advanced internal scripts
Allows for “non NRPE based checks”
A lot of new bugs?
◦ This is the scary part (for me)
 but my testing has show it seems very stable
Unix support

Good question…
◦ Since no one seems to like to program on Windows
 I brought NSClient++ to “unix” 
◦ Because I can
 With the new core comes portability
 So, perhaps the better question was:
 Why not?

Will NOT be supported for some time though
◦ Unless someone wants to help out
New Settings

Hierarchical settings subsystem
◦ [/settings/NRPE/server]
◦ allow arguments=false

Instead of
◦ [NRPE Server]
◦ allow_arguments=false

Why did I do this?
◦ Because it was fun 
◦ Number of options has started to explode
◦ Simpler to use the registry (as well as xml?)

Since settings have “url:s”
◦
◦
◦
◦

old://${exe-path}/nsc.ini
ini://${base-path}/nsclient.ini
registry://HKEY_LOCAL_MACHINE/software/NSClient++
http://my.central.server/config/${hostname}.ini
Allows extensions (not via plugins though)
◦ Maybe in the future:
 lua://${base-path}/config.lua
 python://${base-path}/config.py

You can mix and match:
◦ ini://${base-path}/nsclient.ini




Can “include”:
registry://HKEY_LOCAL_MACHINE/software/NSClient++
Which in turn includes
http://conf.server/${hostname}.conf

Ability to load the same plugin twice.
Normal (default alias is python)

Multiple modules (define two aliases foo and bar)

◦
◦
◦
◦
[/modules]
PytonScript=
[/settings/python/scripts]
test.py
◦
◦
◦
◦
◦
◦
◦
[/modules]
foo=PytonScript
bar=PythonScript
[/settings/foo/scripts]
test1.py
[/settings/bar/scripts]
test2.py

It depends…
◦ If you are “still” using check_nt:
 Probably not
◦ If you are using NSCA:
 Maybe not
◦ If you want to use all new features
 Yes

How do I change?
◦ It is pretty simple…
 nscp --settings --migrate-to ini
◦ (or)
 nscp --settings --migrate-to registry
New protocol
Firewall
Windows Computer
Nagios Server
CPU
Fork
Fork
...
check_nrpe
Fork
Disk
Fork
Fork
...
check_nrpe
Fork
check_nrpe
Fork
NSClient++
Mem
Fork
Fork
...
... Fork
Fork
...
...
Fork
Firewall
Windows Computer
CPU
check_nscp
Disk
NSClient++
Mem
...
Nagios Server

Allows more then one command to be sent
Used internally for plugins
Support both passive and active checks
Supports configuration, management, etc…
Extensible

But will also support:




◦ Multiple locales (based on utf)
◦ Unlimited payloads (soft configurable)
◦ Support real performance data (not strings)
Distributed monitoring
Scheduler
NSCA...
Command
broker
CheckCPU
...
Real time
plugin
Event
broker
XXX Agent
XXX Server
NSCA Agent
NSCA Server
...
...
check_nrpe
NRPE
Server
Command
broker
...
Event
broker
Check
EventLog
CheckCPU
Event
broker
NSCA Agent
SYSLOG Agent
NSCA Server
SysLog Server

an extension of the passive checks
◦
◦
◦
◦


”Something” can send notification events
”Something” can receive notification events
Agents can forward notification events
Replaces NSCAListener module
Supports routing
Not a one-to-one mapping.
◦ Multiple consumers
◦ multiple producers

Allows
◦ Passive plugins (other then the built-in NSCA)
◦ Script and rule based routing
Python scripting


Built-in python scripting
Has full API support
◦ Can build ”modules” in python
◦ Can access settings
◦ Can do “anything”


Primarily used by me for unit-testing
Requires a working python install
Le Roi est mort, vive le Roi!


0.4.x (ish) will be the last ”Windows”
monitoring agent
The idea is to make it more:
◦ A platform/client/server for distributed monitoring
 Regardless of os/system
 Regardless of Monitoring solutions

Don’t worry…
◦ It will still work just fine as a ”Windows Monitoring
Agent”
◦ But in addition to this you will be able to do more.
Questions?
[email protected]
http://www.linkedin.com/in/mickem
http://nsclient.org
Facebook: facebook.com/nsclient
http://nsclient.org/nscp/conferances/2011/nwcna/