OMD_Nagios_Hardware-Software

Download Report

Transcript OMD_Nagios_Hardware-Software

Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Joe VanAndel NCAR/EOL 2012/3/29

Why is Monitoring Important?

Why is Monitoring Important?

• Software systems can be very complex: • networked data sources • multiple computers • long running daemons • Hardware (including computers) can fail

Why is Monitoring Important (2)?

• Someone is relying on your system to produce or process data.

• Computers are better than people at monitoring manual procedures are error prone and don’t cover 24x7.

• Your staff may need to be notified out-of-hours if failures occur.

Why is Monitoring Important to S-Pol?

• S-Pol is a complex system of hardware and software - need to detect problems so they can be quickly corrected.

• Notifications allow unattended operation, so staff don’t have to stay on site 24x7.

• Can not afford to have 3 shifts in field projects

What is OMD?

• Open Monitoring Distribution (http://omdistro.org) • runs on Linux • Bundles Nagios with 16 useful utilities, including • check_mk - creates Nagios configurations for you!

• rrdtool/rrdcached - store and retrieve time series data, supports graphing of performance data.

Why use OMD?

• complete package of monitoring tools • avoid the effort of compiling and integrating Nagios add-ons • Web based monitoring - from anywhere!

Why use check_mk?

• Automatically generates Nagios rules for each machine you monitor.

• Lower overhead allows monitoring more checks on more hosts.

• easy to create both hardware and software checks.

• The S-Pol radar had 700 checks running on 14 hosts we didn’t want to generate the Nagios configuration manually.

check_mk architecture

figure from http://mathias-kettner.de

RRD is “Round Robin Database” which efficiently stores the output from check_mk.

check_mk_agent

Getting Started with OMD • install the RPM • $ omd create mysite # the monitoring instance • create scripts in /usr/lib/check_mk_agent/local • $ check_mk -I # run inventory • $ omd start mysite # start daemons.

• open the check_mk URL in a browser.

• Writing a check is simple write a C program, shell script, or Python script • query hardware or software status • output string(s) to stdout: "0 PgenTritonRaidStatus - OK" • run a check_mk inventory to • find your script • generate the Nagios configuration

/usr/lib/check_mk_agent/local/filecount #!/bin/bash do else statustxt=WARNING statustxt=CRITICAL

S-Pol monitoring • Radar hardware for S-Band & Ka-band: • antenna • transmitter • receiver • Klystron temperature • Container temperatures

Hardware Monitoring Architecture

Sixnet Controller

Hardware monitoring • Sixnet controller communicates to measurement modules using RS-485 • monitors transmitter status • monitors antenna status • monitors transmitter temperature • Sixnet controller runs Linux, so adding a check_mk_agent was easy!

What else?

• Computer status: • cpu load, • disk space, • memory usage • radar software - tasks running, products being produced • fetching data: satellite images, soundings, forecast model output

Implementation • installed OMD on a rack-mount Linux server • installed check_mk_agent on all monitored computers • wrote scripts, installed in /usr/lib/check_mk_agent/local

Implementation(2) • Configured digital IO modules (controlled by an embedded Sixnet computer) to monitor S-Pol hardware • Wrote a program on the Sixnet that reported hardware status to check_mk_agent • Send Ka-band status over the network, wrote software to create status files readable by check_mk scripts

Types of S-Pol checks • scripts/programs directly monitor hardware or software • hybrid scripts - process the output of an existing program, output check_mk status reports.

Implementation(2) • configured GSM cell phone to send SMS messages • software from gnokii.org

• bought local SIM • wrote script to limit frequency of SMS messages

Sample Web Screens

Challenges • learning how to create advanced checks with graphs • Avoiding false alarms (particularly after hours!) • limiting frequency of notifications - getting 20 text messages on your cell phone in 5 minutes is not helpful!

How well did OMD/Nagios work?

• The second shift only had to be on-site from 3:00PM to 8:00PM, rather than until 11:00PM • Daytime: OMD/Nagios warned staff of problems on multiple occasions.

• Offhours: OMD/Nagios notified S-Pol staff of critical hardware/software failures on multiple occasions

24x7 Operations : w/o working 24x7 • Added SMS (text message) notifications to Nagios • Technicians and Engineers carried cell phones • Nagios sent SMS when hardware or software problems occurred.

• Technicians and Engineers would access Nagios web pages via 3G modems on laptops

FUTURE • Monitoring of diesel generators • Add remote control: • generator & transfer switch • reset of transmitter faults • reset of antenna faults

Conclusion • Monitoring is important for any system, critical for complex or unattended operation • OMD/Nagios makes it easy to deploy monitoring • OMD/Nagios helped EOL maintain high data quality from S-Pol without requiring staff 24x7 on site.

• Notifications via SMS and remote access to OMD’s web pages are very helpful.

Acknowledgments • Ethan Galstad - Nagios chief developer • Mathias Kettner - check_mk • Fatima Dembele (summer intern) - prototyping • Paloma Gutierrez - hardware monitoring • Chris Burghart - Ka-band monitoring • Mike Dixon - Ka-band & HAWK monitoring

Questions?