Transcript Lemon
Lemon Computer Monitoring at CERN Miroslav Siket CERN-IT/FIO-FS Outline • • • • • • Lemon – what it is? Structure Functionality Metrics Alarms Web visualization 13/4/2005 Sysadmin Introduction at CERN 2 Lemon – LHC Era Monitoring • Lemon is a software package containing tools for monitoring status and performance of the computers (currently limited to Linux and Solaris OS) • Contains following components: – – – – – – – – Sensors (they measure individual metrics [values]) MSA (Monitoring Sensor Agent) Monitoring Repository (a daemon that receives the metrics) Monitoring Repository Backend (storage) LRF (Lemon RRD tool framework – caching and web presentation tools) Correlation Engines Lemon Client (tool for retrieving data) LAG (Laser Alarm Gateway – tool for passing alarms to Laser system) • See http://cern.ch/lemon for more info 13/4/2005 Sysadmin Introduction at CERN 3 Lemon - schema SQL Monitoring Repository SOAP SOAP Correlation Engines Repository backend RRDTool / PHP apache TCP/UDP HTTP Nodes Sensor 13/4/2005 Sensor Web browser Lemon CLI Monitoring Agent Sensor Sysadmin Introduction at CERN User 4 Sensor (MS) and Sensor Agent (MSA) • Sensor measures the data based on the requests from MSA • MSA receives the data from sensor through the pipe • MSA sends the data to the Monitoring Repository (MR) through the UDP socket • Typical communication between the two: – – – – – MSA forks sensor system MSA: INI 1 LoadAvg MSA: GET 1 Sensor: PUT 1 0.42 MSA: sends UDP packet to MR • MSA controls the frequency and status of individual sensors (several of them) • You can write sensors yourself (bash, c++, perl,…) 13/4/2005 Sysadmin Introduction at CERN 5 Metrics • Measured metrics (about 255): – – – – – Status: OS, disk DMA, RPM ok?, ethlink,… Daemons: sshd, ntpd, syslogd, friod,… alive File size of files: /etc/nologin, /afs/cern.ch,… Security: sshd md5chksum,… Performace: CPU utilization, memory utilization, network bandwidth use,… – Misc: virtual organization number of jobs, smart status, temperature,… (see the list at http://cern.ch/lemon-status/metric_descriptions.php) • Status of the MSA can be seen in the /var/log/edg-fmon-agent.log file on each machine (log file to edg-fmon-agent daemon) 13/4/2005 Sysadmin Introduction at CERN 6 Lemon at CERN • • • • • • Lemon monitors about 2100 computer within 100 clusters On average it collects about 70 metrics from each host Part of the ELFms Integrated with Sure alarm system Collecting about 1GB/day Integrated with CDB Node Configuration Management Node Management 13/4/2005 Sysadmin Introduction at CERN 7 Sure system • Sure sensor checks values of the individual metrics with reference values and rises an alarms when the conditions are met • Examples: – Loadavg > 20 – raises Load_high alarm – # of sshd daemons < 1 – raises sshd_dead alarm – # of Smart failure in /var/log/messages > 0 – raises smart_failure alarm • Alarms are sent to the Sure servers • Operators acknowledge alarms, log them and if unable to resolve, notify responsible person • Sysadmins receive ITCM tickets – for each alarms there are procedures how to handle them • Special case – NO_CONTACT alarm 13/4/2005 Sysadmin Introduction at CERN 8 Web visualization and framework • LRF pre-process part of the data from Monitoring Repoistory and stores them into the RRD files for fast visualization • Groups the logical units (nodes) into clusters based on: – – – – CDB [configuration database] definition user defined clusters HW type Racks • Php based web interface displays preprocessed data on demand and gives together with CDB and status information general overview • Check it at http://cern.ch/lemon-status 13/4/2005 Sysadmin Introduction at CERN 9 Summary • Lemon serves to provide monitoring information about the computers in the Computer Center at CERN • Thanks to its integration with Sure (alarm system) it allows fast and easy identification and repair of problems • In connection to CDB it allows easier overview of services and visualization of their performance • In connection to Remedy (ITCM) allows overview of the problems for the given service 13/4/2005 Sysadmin Introduction at CERN 10