Introduction - Northern Kentucky University

Download Report

Transcript Introduction - Northern Kentucky University

CIT 470: Advanced Network and
System Administration
System Monitoring
CIT 470: Advanced Network and System Administration
Slide #1
Topics
1.
2.
3.
4.
5.
6.
Why monitoring?
Historical monitoring
Real-time monitoring
Monitoring techniques
Monit
Web-based monitoring tools
CIT 470: Advanced Network and System Administration
Slide #2
Why Monitoring?
“If you aren’t monitoring a service,
you can’t manage it.”
CIT 470: Advanced Network and System Administration
Slide #3
Why Monitoring?
1.
2.
3.
4.
Rapidly detect and fix problems.
Identify the source of problems.
Predict and avoid future problems.
Document an SA’s achievements.
CIT 470: Advanced Network and System Administration
Slide #4
Historical Monitoring
Record long-term system statistics.
Uptime.
Performance.
Security.
Utilizations.
Examples
Web server uptime was 99.99% last year, compared to
99.9% the previous year.
Peak network usage is 8 MBps, up from 5 MBps.
Uses
Capacity planning.
Planning for reliability or security improvements.
CIT 470: Advanced Network and System Administration
Slide #5
Historical Monitoring Processes
Polling
Take measurements at regular intervals.
Store database of measurements.
Graph summaries of collected data.
Measurement Tools
iostat
vmstat
ps
sar
CIT 470: Advanced Network and System Administration
Slide #6
Real-time Monitoring
Alert SA to failures as they happen.
Discover problems before customer does.
Shorter outages.
Better reputation.
Real-time Monitor components
Monitoring system (poll or alert).
Notification system.
CIT 470: Advanced Network and System Administration
Slide #7
Real-time Monitoring Techniques
Polling
Poll systems and applications for status.
Ex: ping critical servers every 5 minutes.
Alerting
Many systems can send alerts to monitoring
system when they detect a problem.
Ex: RAID array logs a disk failure.
CIT 470: Advanced Network and System Administration
Slide #8
Notification
Types of notification
1. Email
2. Paging
3. Phone call
Reliability
1. Notification system should not depend on
system being monitored.
2. Email can fail or have long delays.
3. Pages are susceptible to third party failures and
monitoring.
CIT 470: Advanced Network and System Administration
Slide #9
Escalation
What if the SA is on vacation?
Notifications need to be transferrable.
Static: reconfigure notifier before vacation.
Dynamic: configurable set of receipients.
Ex: If SA doesn’t respond in 1 hour,
notify manager.
CIT 470: Advanced Network and System Administration
Slide #10
Types of monitoring
Availability
Watch for outages in network, host, apps.
Ex: cannot reach mail server.
Capacity
Check thresholds for CPU, mem, disk, network.
Ex: mail spool disk is 95% full
CIT 470: Advanced Network and System Administration
Slide #11
Active Monitoring
Active monitoring systems can fix problems.
1. Respond faster than a human can.
2. Can typically only implement temporary fix.
3. Can’t fix all problems: bad disk, out of paper.
Risks
Reliability: Test active responses thoroughly before
deployment.
Security: Active monitor typically needs admin
access on all monitored systems.
CIT 470: Advanced Network and System Administration
Slide #12
Levels of Testing
1. Check server is pingable.
Verifies connectivity from monitor only.
2. Check that application is up.
Make a TCP connection to service port.
Check process or service list.
3. End-to-end testing.
Entire transaction as customer would do.
Ex: send and receive an e-mail message.
CIT 470: Advanced Network and System Administration
Slide #13
Running monit
Starting
monit [-v]
Status
monit status
monit summary
(also provides web interface on port 2812)
Stopping
monit quit
CIT 470: Advanced Network and System Administration
Slide #14
Global configuration
set
set
set
set
set
daemon 60
logfile syslog facility log_daemon
alert root@domain
mailserver my-server
httpd port 2812 address localhost
allow localhost
allow admin:monit
CIT 470: Advanced Network and System Administration
Slide #15
Monitoring a Process
check process apache
with pidfile "/usr/local/apache/logs/httpd.pid"
start = “/etc/init.d/httpd start"
stop = "/etc/init.d/httpd stop"
if failed port 80 and protocol http
and request "/cgi-bin/printenv"
then restart
if cpu usage is greater than 60 percent for
2 cycles then alert
if cpu usage > 98% for 5 cycles then restart
if 2 restarts within 3 cycles then timeout
CIT 470: Advanced Network and System Administration
Slide #16
Monitoring a File
# Rotate log if it gets too big
check file access_log
with path /var/log/access_log
if size > 100 Mb
then exec "/usr/sbin/logrotate -f
rotate_apache_now“
# Restart Apache if config changes
check file httpd.conf
with path /usr/local/apache/conf/httpd.conf
if changed checksum then exec
"/usr/local/apache/bin/apachectl
graceful"
CIT 470: Advanced Network and System Administration
Slide #17
Monitoring CPU
check system localhost
if loadavg (1min) > 5
if loadavg (5min) > 3
if memory usage > 80%
if cpu usage (user) >
CIT 470: Advanced Network and System Administration
then alert
then alert
then alert
80% then alert
Slide #18
Monitoring a Disk
check device
if space
check device
if space
rootfs with path /
usage > 90% then alert
varfs with path /var
usage > 90% then alert
CIT 470: Advanced Network and System Administration
Slide #19
Monitoring Remote Hosts
# Ping the host to see if it’s up
check host foo with address foo.com
if failed icmp type echo
with timeout 15 seconds then alert
# Detailed test, accessing web services
check host foo with address foo
if failed port 80 protocol http and
request “/status” then alert
if failed port 443 type TCPSSL and
protocol http with timeout 15
seconds then alert
CIT 470: Advanced Network and System Administration
Slide #20
Monitoring Tools
•
•
•
•
•
•
•
•
•
•
•
Ganglia
Cacti
Nagios
Zabbix
Hyperic HQ
Munin
ZenOSS
OpenNMS
GroundWork
God
Monit
Nagios
CIT 470: Advanced Network and System Administration
Slide #22
Nagios Network Maps
CIT 470: Advanced Network and System Administration
Slide #23
Nagios Graphs
CIT 470: Advanced Network and System Administration
Slide #24
Zabbix Graphs
CIT 470: Advanced Network and System Administration
Slide #25
References
1.
2.
3.
4.
5.
6.
Mark Burgess, Principles of System and Network Administration,
Wiley, 2000.
Aeleen Frisch, Essential System Administration, 3rd edition, O’Reilly,
2002.
Mike Loukides and Gian-Paolo D. Musumeci, System Performance
Tuning, 2nd edition, O’Reilly, 2003.
Monit doc, http://mmonit.com/monit/documentation/monit.pdf
Evi Nemeth et al, UNIX System Administration Handbook, 3rd
edition, Prentice Hall, 2001.
Wikipedia,
http://en.wikipedia.org/wiki/Comparison_of_network_monitoring_sy
stems
CIT 470: Advanced Network and System Administration
Slide #26