Transcript Lemon

Lemon
Computer Monitoring at CERN
Miroslav Siket
CERN-IT/FIO-FS
Outline
•
•
•
•
•
•
Lemon – what it is?
Structure
Functionality
Metrics
Alarms
Web visualization
13/4/2005
Sysadmin Introduction at CERN
2
Lemon – LHC Era Monitoring
• Lemon is a software package containing tools for monitoring
status and performance of the computers (currently limited to
Linux and Solaris OS)
• Contains following components:
–
–
–
–
–
–
–
–
Sensors (they measure individual metrics [values])
MSA (Monitoring Sensor Agent)
Monitoring Repository (a daemon that receives the metrics)
Monitoring Repository Backend (storage)
LRF (Lemon RRD tool framework – caching and web presentation tools)
Correlation Engines
Lemon Client (tool for retrieving data)
LAG (Laser Alarm Gateway – tool for passing alarms to Laser system)
• See http://cern.ch/lemon for more info
13/4/2005
Sysadmin Introduction at CERN
3
Lemon - schema
SQL
Monitoring
Repository
SOAP
SOAP
Correlation
Engines
Repository
backend
RRDTool /
PHP
apache
TCP/UDP
HTTP
Nodes
Sensor
13/4/2005
Sensor
Web
browser
Lemon
CLI
Monitoring Agent
Sensor
Sysadmin Introduction at CERN
User
4
Sensor (MS) and Sensor Agent (MSA)
• Sensor measures the data based on the requests from MSA
• MSA receives the data from sensor through the pipe
• MSA sends the data to the Monitoring Repository (MR) through the
UDP socket
• Typical communication between the two:
–
–
–
–
–
MSA forks sensor system
MSA: INI 1 LoadAvg
MSA: GET 1
Sensor: PUT 1 0.42
MSA: sends UDP packet to MR
• MSA controls the frequency and status of individual sensors (several of
them)
• You can write sensors yourself (bash, c++, perl,…)
13/4/2005
Sysadmin Introduction at CERN
5
Metrics
• Measured metrics (about 255):
–
–
–
–
–
Status: OS, disk DMA, RPM ok?, ethlink,…
Daemons: sshd, ntpd, syslogd, friod,… alive
File size of files: /etc/nologin, /afs/cern.ch,…
Security: sshd md5chksum,…
Performace: CPU utilization, memory utilization, network
bandwidth use,…
– Misc: virtual organization number of jobs, smart status,
temperature,…
(see the list at http://cern.ch/lemon-status/metric_descriptions.php)
• Status of the MSA can be seen in the /var/log/edg-fmon-agent.log file on each
machine (log file to edg-fmon-agent daemon)
13/4/2005
Sysadmin Introduction at CERN
6
Lemon at CERN
•
•
•
•
•
•
Lemon monitors about 2100 computer within 100 clusters
On average it collects about 70 metrics from each host
Part of the ELFms
Integrated with Sure alarm system
Collecting about 1GB/day
Integrated with CDB
Node
Configuration
Management
Node
Management
13/4/2005
Sysadmin Introduction at CERN
7
Sure system
• Sure sensor checks values of the individual metrics with reference
values and rises an alarms when the conditions are met
• Examples:
– Loadavg > 20 – raises Load_high alarm
– # of sshd daemons < 1 – raises sshd_dead alarm
– # of Smart failure in /var/log/messages > 0 – raises smart_failure alarm
• Alarms are sent to the Sure servers
• Operators acknowledge alarms, log them and if unable to resolve,
notify responsible person
• Sysadmins receive ITCM tickets – for each alarms there are
procedures how to handle them
• Special case – NO_CONTACT alarm
13/4/2005
Sysadmin Introduction at CERN
8
Web visualization and framework
• LRF pre-process part of the data from Monitoring Repoistory and
stores them into the RRD files for fast visualization
• Groups the logical units (nodes) into clusters based on:
–
–
–
–
CDB [configuration database] definition
user defined clusters
HW type
Racks
• Php based web interface displays preprocessed data on demand
and gives together with CDB and status information general
overview
• Check it at http://cern.ch/lemon-status
13/4/2005
Sysadmin Introduction at CERN
9
Summary
• Lemon serves to provide monitoring information about the
computers in the Computer Center at CERN
• Thanks to its integration with Sure (alarm system) it allows fast
and easy identification and repair of problems
• In connection to CDB it allows easier overview of services and
visualization of their performance
• In connection to Remedy (ITCM) allows overview of the problems
for the given service
13/4/2005
Sysadmin Introduction at CERN
10