SRM Monitoring

Download Report

Transcript SRM Monitoring

DPM Monitoring

Wahid Bhimji University of Edinburgh,

Apr-10 Wahid Bhimji – Files access 1

Intro

• New DPM developer Alejandro Álvarez Ayllón working on new nagios based DPM monitoring List of Probes: https://twiki.cern.ch/twiki/bin/view/EGEE/LCGDMMonitoring Bridge to examples running at CERN: • • http://aalvarez.web.cern.ch/aalvarez/cgi/bridge.py/gt septic/nagios3/ He’s happy to add more probes (very responsive). He also wants feedback on sensible WARN / FAIL values We can also contribute in our own probes Apr-10 Wahid Bhimji – Files access 2

LCGDM plugins

• Check validity of host certificates. – – check_hostcert Warning and critical configurable: Days until the certificate expires • DB password lifetime – – – check_oracle_expiration Warning and critical configurable: Days until the password expires Connection string, user and password can be specified • Disk partitions activity (bytes/s in and out) – – – check_partition_activity No warning or critical criteria. Individual disks can be selected. • CPU utilization (System/Idle/IOwait/IRQ) – – check_cpu Warning and critical configurable: Upper limit of CPU percentage per category • Network activity: bytes/s in and out (and error percentage) – check_network – No warning or critical criteria. – Individual interfaces can be selected • Pool free space plus filesystem status – check_dpm_pool – Warning and critical configurable: Free space per subsystem or per pool. Specified as bytes (with suffixes K,M,G,T,P). Apr-10 – 3

LCGDM probes cont..

• Collecting information about disk server activity (network, disk I/O, memory, number of connections) splitting the information between sequential I/O (gridFTP and rfcp) and random I/O (rfio and xroot) – check_process that) Can be used for that, excepting disk I/O and network usage (apparently a kernel patch is needed for – Warning and critical configurable: Number of instances, % of CPU, % of memory, number of threads, number of connections, number of file descriptors. – Individual processes can be selected. • DPNS ping – check_dpns – Warning and critical configurable: ping time in millisecond. – Can be used remotely. • GridFTP – check_gridftp – No warning criteria. Critical if a file can not be uploaded, downloaded, or the comparison is not successful. – Can be used remotely. • Published information – check_dpm_infosys – No warning criteria. Critical if any of the requests information is not being published. – Can be used remotely. • RFIO – check_rfio Apr-10 – Everything that applies to GridFTP probe. Can NOT be executed locally. 4

From NAGIOS itself

• DB activity and size – NAGIOS: check_oracle, check_mysql • Number of processes and threads in use – NAGIOS: check_procs (not threads, though) • Check if filesystem correctly mounted – NAGIOS: check_disk already does this • Disk partitions: used and free – NAGIOS: check_disk • Memory: swap, free and used – NAGIOS: check_swap • Apr-10 Load average – NAGIOS: check_load Wahid Bhimji – Files access 5

From grid-monitoring

• • • • • Check validity of CRLs – crls from org.sam.sec

Check validity of CAs – check_ca_dist Number of sockets used for RFIO and number of sockets used for gridFTP – check_netstat.pl

from Nagios Exchange can be used fot that. Socket count – check_netstat.pl

does that and much more. Directory size – check_dirsize.sh

may be useful. Apr-10 Wahid Bhimji – Files access 6

Apr-10 Wahid Bhimji – Files access 7

Can plot stuff with pnp4nagios

Apr-10 Wahid Bhimji – Files access 8

Conclusions / Questions

• • This is nice - Take a look at the probes and give me or Alex some feedback Or try it out yourself. Not tied to any release http://etics repository.cern.ch:8080/repository/pm/volatile/repo md/name/lcgdm_head_sl5_x86_64_gcc412/index.ht

ml • Apr-10 Do we want to add performance info into this?

– Like what was in GridPPDPMMonitor – Summer student Martin (see DPM Stressing talk) could _maybe_ do some of that Wahid Bhimji – Files access 9