Lemon Monitoring Update Miroslav Siket, German Cancio, Murthy Chandregiri, Rohitashva Sharma, Dennis Waldron CERN-IT/FIO-FS HEPIX Workshop SLAC Oct 10-14, 2005 http://cern.ch/lemon Oct 13th 2005 HEPIX Workshop, SLAC.
Download ReportTranscript Lemon Monitoring Update Miroslav Siket, German Cancio, Murthy Chandregiri, Rohitashva Sharma, Dennis Waldron CERN-IT/FIO-FS HEPIX Workshop SLAC Oct 10-14, 2005 http://cern.ch/lemon Oct 13th 2005 HEPIX Workshop, SLAC.
Lemon Monitoring Update Miroslav Siket, German Cancio, Murthy Chandregiri, Rohitashva Sharma, Dennis Waldron CERN-IT/FIO-FS HEPIX Workshop SLAC Oct 10-14, 2005 http://cern.ch/lemon Oct 13th 2005 HEPIX Workshop, SLAC 1 Outline • • • • • • • Where last HEPIX left off monitoring Usage overview of Lemon Security update on Lemon Alarm systems and integration with Lemon Lemon new development Installation and configuration Conclusion Oct 13th 2005 HEPIX Workshop, SLAC 2 Last Hepix followup • No common monitoring solutions in HEP community: – Solutions based on preference of service managers at all sites (Ganglia, Nagios, ranger, NGOP, Lemon, MonAlisa,…) – Not compatible infrastructure, difficult to make gateways into each other – Mostly not including alarm system or problem tracking solutions – Most not scalable, some do not provide service or other type of monitoring, missing features, no GUIs,… – Different architectures – based on pull or push mechanisms, no failovers or high availability solutions available • No generally known solution, some go for their own, some choose from the pool of existing ones Multiple development lines, not much information available Will Lemon fill in the gap? • • Oct 13th 2005 HEPIX Workshop, SLAC 3 Lemon Usage Overview Overview (Sep 2005) – Usage: – CERN: – CERN CC (2500 nodes with 100 clusters, in daily operation and tests) – AB department at CERN (5 clusters with 100+ nodes, production and test) – CMS online (64 plans for 400+, online cluster) – Others: – LCG sites (180 sites with 1,100 nodes, multiple clusters per site in production, GridIce) – Aachen - plan to use it on their cluster – about 20 nodes – S3group (US company) - 2 clusters (28nodes) – BARC, India (their clusters – development partner) – Evaluating – several other institutes (IN2P3, INFN,…) – Lemon parts used: – Agent, sensors (default set) + own sensors (GridIce) + Server (“flat-file” based) – Oracle based – evaluation by IN2P3 – Web based status pages Oct 13th 2005 HEPIX Workshop, SLAC 4 Overview cont… – Platforms: i386, x86_64, ia64, Solaris – Configuration management: – Yaim – Quattor – Home made tools (scripts) and some (small sites) manually – Problems: – Needed detailed manual on writing sensors, examples – Restricted functionality of file based server (enhancements undergoing) – Wishes: – Configuration GUIs, management – Enhanced web interface – Additional backend solutions (SQLite, MySQL) – Security (LCG requirement) – Comments: – Usually satisfied with the support/response of Lemon team. Full listing at http://cern.ch/lemon/doc/usage_overview_2005.htm Oct 13th 2005 HEPIX Workshop, SLAC 5 Security in Monitoring • • • Almost no tool comes with authentication and confidentiality of collected data (some do that by obscuring their protocols/hide implementation information) What are the requirements: • Authentication of source (i.e. prevent malicious parties from faking data and unwanted source from over-flooding servers, limits impact of compromised machines,…) • Encryption of data (some data could be confidential – failure rates, systems status, security statistics, grid user statistics, …) • Access restrictions – who can read the data New version (coming end October) of Lemon is coming with SSL (RSA,DSA or X509) based authentication and possibility of encryption of data between the sensors (sensor agent) and server • Authentication – transferring signed data over network – checked on the server against public key of the machine • Encryption – data is encrypted using public key of the server(s) • Access – XML based secure access Oct 13th 2005 HEPIX Workshop, SLAC 6 Lemon Security • Secure mechanism to copy node's public keys to server machine is needed • • • At CERN we use SINDES (https://uimon.cern.ch/twiki/bin/view/FIOgroup/SinDes) Still one needs (who doesn’t) secure network environment to avoid DOS attacks Three default modes of operation to be supported: • No encryption, no authentication • Authentication • Authentication and encryption Node1 [rsa_encrypt(s.pub_key)] rsa_sign(n1.sec_key) Node2 Server1 rsa_verify(metric,n3.pub_key) [rsa_decrypt(s.sec_key)] [rsa_encrypt(s.pub_key)] rsa_sign(n2.sec_key) Node3 Server2 rsa_verify(metric,n1.pub_key) [rsa_decrypt(s.sec_key)] [rsa_encrypt(s.pub_key)] rsa_sign(n3.sec_key) Oct 13th 2005 HEPIX Workshop, SLAC 7 Alarm Systems and Lemon Requirements on the alarm system: – Scalability: 10k+ nodes, 250 alarms, with possible high frequency (100+/min) – Horizontal alarms reduction – e.g. “No contact” alarm hides other alarms on the same node – Vertical reduction – e.g. 70% of nodes in cluster x show same alarm -> alarm on x – History and tracking, priority, sorting – Associated help, possible actions (opening a ticket,…) – GUIs for users and for operators, system managers with ACL based access – Data mining possibility SURE alarm system: – – – – Legacy system at CERN computing centre (12years) Scalability problems, uses old tcl/tk based GUI Missing features Special Lemon sensor for interfacing with SURE system Oct 13th 2005 HEPIX Workshop, SLAC 8 Alarm Systems and Lemon LASER (The LHC Alarm SERvice Project) – – – – Provides nice interfaces and most of the features Under evaluation by our team Currently there is a problem with potential number of defined alarms (1M+) The data is inserted to laser from Lemon by LAG (Lemon Alarm Gateway) Oct 13th 2005 HEPIX Workshop, SLAC 9 Lemon Alarm System • Alarm information is available also through the status web pages • We are considering an option to build web based alarm system on the existing Lemon infrastructure • Current implementation includes overview of alarms with possibility to choose global/cluster views,… Oct 13th 2005 HEPIX Workshop, SLAC 10 Enhancements • Modular configuration (xinit.d style). • /etc/lemon/agent/default.conf. • /etc/lemon/agent/transport/*.conf (location of servers). • /etc/lemon/agent/sensors/*.conf (sensor definitions). • /etc/lemon/agent/metrics/*.conf (metric setup). • XML data retrieval. • Data retrieval in XML from the application server over https. • Authentication and controlled access. • Working on C/C++, Java, Perl, Python and PHP APIs. • Sensors: • DB monitoring enhanced (Oracle). • Added user, tablespaces, session and wait class monitoring. • Exceptions. • Adding on behalf and multiple metrics correlation engine. Oct 13th 2005 HEPIX Workshop, SLAC 11 Installation and setup Simplified lemon installation consists of three steps: 1. Server installation 2. Client installation 3. Web interface installation 1. Server installation: – Install edg-fabricMonitoring-server rpm (“flat file” server) – Configure receiving port in /etc/edg-fmon-server.conf – Start the server daemon (/sbin/service) 2. Clients installation: – Install edg-fabricMonitoring-agent rpm (comes with default metric configuration) – Configure server and its port in /etc/edg-fmon-agent.conf – Start the client daemon (/sbin/service) Oct 13th 2005 HEPIX Workshop, SLAC 12 Installation and Setup (II) 3. Web interface installation – Install and start apache server and PHP rpm – Install rrdtool and lrf (lemon rrd framework) rpms – Configure your clusters in clusters.conf file and start lemonmrd daemon You are done! • Possible additional components: – Computer center synoptic view through xml file – Problem tracking system integration (through php plug-in to your db/application) – Quattor CDB configuration view – through CDB xml profiles – Oracle based repository (for very large installations with high scalability and increased functionality) – Other, new components are easy to add • View detailed instructions at: http://cern.ch/lemon/doc/installation/installation.html Oct 13th 2005 HEPIX Workshop, SLAC 13 Conclusions • Lemon team is working to satisfy requirements and provide concise monitoring system • HEPIX community feedback would be welcome • Alarm system is emerging • If you are looking for a solution, that is scalable, provides wide functionality and is dynamically developed, try Lemon • Check for updates at http://cern.ch/lemon Oct 13th 2005 HEPIX Workshop, SLAC 14