Lemon Monitoring Update Miroslav Siket, German Cancio, Murthy Chandregiri, Rohitashva Sharma, Dennis Waldron CERN-IT/FIO-FS HEPIX Workshop SLAC Oct 10-14, 2005 http://cern.ch/lemon Oct 13th 2005 HEPIX Workshop, SLAC.

Download Report

Transcript Lemon Monitoring Update Miroslav Siket, German Cancio, Murthy Chandregiri, Rohitashva Sharma, Dennis Waldron CERN-IT/FIO-FS HEPIX Workshop SLAC Oct 10-14, 2005 http://cern.ch/lemon Oct 13th 2005 HEPIX Workshop, SLAC.

Lemon Monitoring
Update
Miroslav Siket, German Cancio, Murthy Chandregiri,
Rohitashva Sharma, Dennis Waldron
CERN-IT/FIO-FS
HEPIX Workshop
SLAC Oct 10-14, 2005
http://cern.ch/lemon
Oct 13th 2005
HEPIX Workshop, SLAC
1
Outline
•
•
•
•
•
•
•
Where last HEPIX left off monitoring
Usage overview of Lemon
Security update on Lemon
Alarm systems and integration with Lemon
Lemon new development
Installation and configuration
Conclusion
Oct 13th 2005
HEPIX Workshop, SLAC
2
Last Hepix followup
•
No common monitoring solutions in HEP community:
– Solutions based on preference of service managers at all sites (Ganglia, Nagios,
ranger, NGOP, Lemon, MonAlisa,…)
– Not compatible infrastructure, difficult to make gateways into each other
– Mostly not including alarm system or problem tracking solutions
– Most not scalable, some do not provide service or other type of monitoring,
missing features, no GUIs,…
– Different architectures – based on pull or push mechanisms, no failovers or high
availability solutions available
•
No generally known solution, some go for their own, some choose from the pool of
existing ones
Multiple development lines, not much information available
Will Lemon fill in the gap?
•
•
Oct 13th 2005
HEPIX Workshop, SLAC
3
Lemon Usage Overview
Overview (Sep 2005)
– Usage:
– CERN:
– CERN CC (2500 nodes with 100 clusters, in daily operation and tests)
– AB department at CERN (5 clusters with 100+ nodes, production and test)
– CMS online (64 plans for 400+, online cluster)
– Others:
– LCG sites (180 sites with 1,100 nodes, multiple clusters per site in production, GridIce)
– Aachen - plan to use it on their cluster – about 20 nodes
– S3group (US company) - 2 clusters (28nodes)
– BARC, India (their clusters – development partner)
– Evaluating – several other institutes (IN2P3, INFN,…)
– Lemon parts used:
– Agent, sensors (default set) + own sensors (GridIce) + Server (“flat-file” based)
– Oracle based – evaluation by IN2P3
– Web based status pages
Oct 13th 2005
HEPIX Workshop, SLAC
4
Overview cont…
– Platforms: i386, x86_64, ia64, Solaris
– Configuration management:
– Yaim
– Quattor
– Home made tools (scripts) and some (small sites) manually
– Problems:
– Needed detailed manual on writing sensors, examples
– Restricted functionality of file based server (enhancements undergoing)
– Wishes:
– Configuration GUIs, management
– Enhanced web interface
– Additional backend solutions (SQLite, MySQL)
– Security (LCG requirement)
– Comments:
– Usually satisfied with the support/response of Lemon team.
Full listing at http://cern.ch/lemon/doc/usage_overview_2005.htm
Oct 13th 2005
HEPIX Workshop, SLAC
5
Security in Monitoring
•
•
•
Almost no tool comes with authentication and confidentiality of collected data (some
do that by obscuring their protocols/hide implementation information)
What are the requirements:
•
Authentication of source (i.e. prevent malicious parties from faking data and unwanted
source from over-flooding servers, limits impact of compromised machines,…)
•
Encryption of data (some data could be confidential – failure rates, systems status, security
statistics, grid user statistics, …)
•
Access restrictions – who can read the data
New version (coming end October) of Lemon is coming with SSL (RSA,DSA or X509)
based authentication and possibility of encryption of data between the sensors (sensor
agent) and server
•
Authentication – transferring signed data over network – checked on the server against
public key of the machine
•
Encryption – data is encrypted using public key of the server(s)
•
Access – XML based secure access
Oct 13th 2005
HEPIX Workshop, SLAC
6
Lemon Security
•
Secure mechanism to copy node's public keys to server machine is needed
•
•
•
At CERN we use SINDES (https://uimon.cern.ch/twiki/bin/view/FIOgroup/SinDes)
Still one needs (who doesn’t) secure network environment to avoid DOS attacks
Three default modes of operation to be supported:
•
No encryption, no authentication
•
Authentication
•
Authentication and encryption
Node1
[rsa_encrypt(s.pub_key)]
rsa_sign(n1.sec_key)
Node2
Server1
rsa_verify(metric,n3.pub_key)
[rsa_decrypt(s.sec_key)]
[rsa_encrypt(s.pub_key)]
rsa_sign(n2.sec_key)
Node3
Server2
rsa_verify(metric,n1.pub_key)
[rsa_decrypt(s.sec_key)]
[rsa_encrypt(s.pub_key)]
rsa_sign(n3.sec_key)
Oct 13th 2005
HEPIX Workshop, SLAC
7
Alarm Systems and Lemon
Requirements on the alarm system:
– Scalability: 10k+ nodes, 250 alarms, with possible high frequency (100+/min)
– Horizontal alarms reduction – e.g. “No contact” alarm hides other alarms on the
same node
– Vertical reduction – e.g. 70% of nodes in cluster x show same alarm -> alarm on x
– History and tracking, priority, sorting
– Associated help, possible actions (opening a ticket,…)
– GUIs for users and for operators, system managers with ACL based access
– Data mining possibility
SURE alarm system:
–
–
–
–
Legacy system at CERN computing centre (12years)
Scalability problems, uses old tcl/tk based GUI
Missing features
Special Lemon sensor for interfacing with SURE system
Oct 13th 2005
HEPIX Workshop, SLAC
8
Alarm Systems and Lemon
LASER (The LHC Alarm SERvice Project)
–
–
–
–
Provides nice interfaces and most of the features
Under evaluation by our team
Currently there is a problem with potential number of defined alarms (1M+)
The data is inserted to laser from Lemon by LAG (Lemon Alarm Gateway)
Oct 13th 2005
HEPIX Workshop, SLAC
9
Lemon Alarm System
• Alarm information is available also through the status web pages
• We are considering an option to build web based alarm system on the
existing Lemon infrastructure
• Current implementation includes overview of alarms with possibility to
choose global/cluster views,…
Oct 13th 2005
HEPIX Workshop, SLAC
10
Enhancements
• Modular configuration (xinit.d style).
• /etc/lemon/agent/default.conf.
• /etc/lemon/agent/transport/*.conf (location of servers).
• /etc/lemon/agent/sensors/*.conf (sensor definitions).
• /etc/lemon/agent/metrics/*.conf (metric setup).
• XML data retrieval.
• Data retrieval in XML from the application server over https.
• Authentication and controlled access.
• Working on C/C++, Java, Perl, Python and PHP APIs.
• Sensors:
• DB monitoring enhanced (Oracle).
• Added user, tablespaces, session and wait class monitoring.
• Exceptions.
• Adding on behalf and multiple metrics correlation engine.
Oct 13th 2005
HEPIX Workshop, SLAC
11
Installation and setup
Simplified lemon installation consists of three steps:
1. Server installation
2. Client installation
3. Web interface installation
1. Server installation:
–
Install edg-fabricMonitoring-server rpm (“flat file” server)
–
Configure receiving port in /etc/edg-fmon-server.conf
–
Start the server daemon (/sbin/service)
2. Clients installation:
–
Install edg-fabricMonitoring-agent rpm (comes with default metric
configuration)
–
Configure server and its port in /etc/edg-fmon-agent.conf
–
Start the client daemon (/sbin/service)
Oct 13th 2005
HEPIX Workshop, SLAC
12
Installation and Setup (II)
3. Web interface installation
– Install and start apache server and PHP rpm
– Install rrdtool and lrf (lemon rrd framework) rpms
– Configure your clusters in clusters.conf file and start lemonmrd daemon
You are done!
•
Possible additional components:
– Computer center synoptic view through xml file
– Problem tracking system integration (through php plug-in to your db/application)
– Quattor CDB configuration view – through CDB xml profiles
– Oracle based repository (for very large installations with high scalability and
increased functionality)
– Other, new components are easy to add
•
View detailed instructions at: http://cern.ch/lemon/doc/installation/installation.html
Oct 13th 2005
HEPIX Workshop, SLAC
13
Conclusions
• Lemon team is working to satisfy requirements and provide
concise monitoring system
• HEPIX community feedback would be welcome
• Alarm system is emerging
• If you are looking for a solution, that is scalable, provides wide
functionality and is dynamically developed, try Lemon
• Check for updates at http://cern.ch/lemon
Oct 13th 2005
HEPIX Workshop, SLAC
14