Using OMD/Nagios to Monitor Complex Hardware/Software Systems
Joe VanAndel NCAR/EOL 2012/3/29
Why is Monitoring Important?
Why is Monitoring Important?
• Software systems can be very complex: • networked data sources • multiple computers • long running daemons • Hardware (including computers) can fail
Why is Monitoring Important (2)?
• Someone is relying on your system to produce or process data.
• Computers are better than people at monitoring manual procedures are error prone and don’t cover 24x7.
• Your staff may need to be notified out-of-hours if failures occur.
Why is Monitoring Important to S-Pol?
• S-Pol is a complex system of hardware and software - need to detect problems so they can be quickly corrected.
• Notifications allow unattended operation, so staff don’t have to stay on site 24x7.
• Can not afford to have 3 shifts in field projects
What is OMD?
• Open Monitoring Distribution (http://omdistro.org) • runs on Linux • Bundles Nagios with 16 useful utilities, including • check_mk - creates Nagios configurations for you!
• rrdtool/rrdcached - store and retrieve time series data, supports graphing of performance data.
Why use OMD?
• complete package of monitoring tools • avoid the effort of compiling and integrating Nagios add-ons • Web based monitoring - from anywhere!
Why use check_mk?
• Automatically generates Nagios rules for each machine you monitor.
• Lower overhead allows monitoring more checks on more hosts.
• easy to create both hardware and software checks.
• The S-Pol radar had 700 checks running on 14 hosts we didn’t want to generate the Nagios configuration manually.
figure from http://mathias-kettner.de
RRD is “Round Robin Database” which efficiently stores the output from check_mk.
Getting Started with OMD • install the RPM • $ omd create mysite # the monitoring instance • create scripts in /usr/lib/check_mk_agent/local • $ check_mk -I # run inventory • $ omd start mysite # start daemons.
• open the check_mk URL in a browser.
• Writing a check is simple write a C program, shell script, or Python script • query hardware or software status • output string(s) to stdout: "0 PgenTritonRaidStatus - OK" • run a check_mk inventory to • find your script • generate the Nagios configuration
/usr/lib/check_mk_agent/local/filecount #!/bin/bash do else statustxt=WARNING statustxt=CRITICAL
S-Pol monitoring • Radar hardware for S-Band & Ka-band: • antenna • transmitter • receiver • Klystron temperature • Container temperatures
Hardware Monitoring Architecture
Hardware monitoring • Sixnet controller communicates to measurement modules using RS-485 • monitors transmitter status • monitors antenna status • monitors transmitter temperature • Sixnet controller runs Linux, so adding a check_mk_agent was easy!
• Computer status: • cpu load, • disk space, • memory usage • radar software - tasks running, products being produced • fetching data: satellite images, soundings, forecast model output
Implementation • installed OMD on a rack-mount Linux server • installed check_mk_agent on all monitored computers • wrote scripts, installed in /usr/lib/check_mk_agent/local
Implementation(2) • Configured digital IO modules (controlled by an embedded Sixnet computer) to monitor S-Pol hardware • Wrote a program on the Sixnet that reported hardware status to check_mk_agent • Send Ka-band status over the network, wrote software to create status files readable by check_mk scripts
Types of S-Pol checks • scripts/programs directly monitor hardware or software • hybrid scripts - process the output of an existing program, output check_mk status reports.
Implementation(2) • configured GSM cell phone to send SMS messages • software from gnokii.org
• bought local SIM • wrote script to limit frequency of SMS messages
Sample Web Screens
Challenges • learning how to create advanced checks with graphs • Avoiding false alarms (particularly after hours!) • limiting frequency of notifications - getting 20 text messages on your cell phone in 5 minutes is not helpful!
How well did OMD/Nagios work?
• The second shift only had to be on-site from 3:00PM to 8:00PM, rather than until 11:00PM • Daytime: OMD/Nagios warned staff of problems on multiple occasions.
• Offhours: OMD/Nagios notified S-Pol staff of critical hardware/software failures on multiple occasions
24x7 Operations : w/o working 24x7 • Added SMS (text message) notifications to Nagios • Technicians and Engineers carried cell phones • Nagios sent SMS when hardware or software problems occurred.
• Technicians and Engineers would access Nagios web pages via 3G modems on laptops
FUTURE • Monitoring of diesel generators • Add remote control: • generator & transfer switch • reset of transmitter faults • reset of antenna faults
Conclusion • Monitoring is important for any system, critical for complex or unattended operation • OMD/Nagios makes it easy to deploy monitoring • OMD/Nagios helped EOL maintain high data quality from S-Pol without requiring staff 24x7 on site.
• Notifications via SMS and remote access to OMD’s web pages are very helpful.
Acknowledgments • Ethan Galstad - Nagios chief developer • Mathias Kettner - check_mk • Fatima Dembele (summer intern) - prototyping • Paloma Gutierrez - hardware monitoring • Chris Burghart - Ka-band monitoring • Mike Dixon - Ka-band & HAWK monitoring