UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN RAL 30th June 2009
Download ReportTranscript UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN RAL 30th June 2009
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN RAL 30th June 2009 SouthGrid Tier 2 • The UK is split into 4 geographically distributed tier 2 centres • SouthGrid comprise of all the southern sites not in London • New sites likely to join SouthGrid Status June 2009 2 UK Tier 2 reported CPU – Historical View to present 4000000 3500000 3000000 2500000 UK-London-Tier2 UK-NorthGrid 2000000 UK-ScotGrid 1500000 UK-SouthGrid 1000000 500000 0 Jul-08 Aug08 Sep- Oct-08 Nov-08 Dec- Jan-09 Feb-09 Mar-09 Apr-09 May- Jun-09 08 08 09 K SPEC int 2000 hours SouthGrid Status June 2009 3 SouthGrid Sites Accounting as reported by APEL 700000 600000 500000 JET BHAM 400000 BRIS CAM 300000 OX 200000 RALPPD 100000 0 May08 Jun- Jul-08 Aug08 08 Sep08 Oct08 Nov08 Dec08 Jan09 Feb09 Mar09 Apr09 May09 Jun09 K SPEC int 2000 hours SouthGrid Status June 2009 4 Site Upgrades in the last 6 months • RALPPD Increase of 960 cores (1716KSI2K) +380TB • Cambridge 32 cores (83KSI2K) + 20TB • Birmingham 64 cores on pp cluster and 192 cores HPC cluster which add ~535KSI2K • Bristol original cluster replaced by new quad cores systems 24 cores + increased share of the HPC cluster 125KSI2k + 44TB • Oxford extra 208 cores 540KSI2K + 60TB • Jet extra 120 cores 240KSI2K SouthGrid Status June 2009 5 New Total Q209 SouthGrid GridPP CPU (kSI2K) Storage (TB) % of MoU CPU % of MoU Disk EDFA-JET 483 1.5 Birmingham 728 90 244.30% 105.88% Bristol 192 55 100.52% 177.42% Cambridge 455 60 245.95% 136.36% Oxford 972 160 309.55% 213.33% RALPPD 2743 633 208.91% 207.54% Totals 5573 999.5 242.20% 185.09% SouthGrid Status June 2009 6 Site Setup Summary Site Cluster/s Installation Method Batch System Birmingham Dedicated & Shared HPC PXE, Kickstart, CFEngine. Tarball for HPC Torque Bristol Small Dedicated & Shared HPC PXE, Kickstart, CFEngine. Tarball for HPC Torque Cambridge Dedicated PXE, Kickstart, custom scripts Condor JET Dedicated Kickstart, custom scripts Torque Oxford Dedicated PXE, Kickstart, CFEngine Torque RAL PPD Dedicated PXE, Kickstart, CFEngine SouthGrid Status June 2009 Torque 7 New Staff • Jan 2009 Kashif Mohammad Deputy Technical Coordinator based at Oxford • May 2009 Chris Curtis SouthGrid Hardware support based at Birmingham • June 2009 Bob Cregan HPC support at Bristol SouthGrid Status June 2009 8 Oxford Site Report SouthGrid Status June 2009 9 Oxford Central Physics • Centrally supported Windows XP desktops (~500) • Physics wide Exchange Server for email – BES to support Blackberries • Network services for MAC OSX – Astro converted entirely to Central Physics IT services (120 OSX systems) – Started experimenting with Xgrid • Media services – Photocopiers/printers replaced – much lower costs than other departmental printers. • Network – Network is too large. Looking to divide into smaller pieces – better management and easier to scale to higher performance. – Wireless – introduced EDUROAM on all physics WLAN base stations. – Identified problems with 3com 4200G switch which caused a few connections to run very slowly. Now fixed. – Improved network core and computer room with redundant pairs of 3com SouthGrid Status June 2009 5500 switches. 10 Oxford Tier 2 Report Major Upgrade 2007 • Lack of decent Computer room with adequate power and A/C held back upgrading our 2004 kit until Autumn 07 • 11 systems, 22 servers, 44 cpus, 176 cores. Intel 5345 Clovertown cpu’s provide ~430KSI2K, 16GB memory for each server. Each server has a 500GB SATA HD with IPMI remote KVM cards. • 11 servers each providing 9TB usable storage after RAID 6, total ~99TB, 3ware 9650-16ML controller. • Two racks, 4 Redundant Management SouthGrid Status June 2009 Nodes, 4 APC 7953 PDU’s, 4 UPS’s 11 Oxford Physics now has two Computer Rooms • Oxford’s Grid Cluster initially housed in the departmental Computer room late 2007 • Later moved to the new shared University room at Begbroke (5 miles up the road) SouthGrid Status June 2009 12 Oxford Upgrade 2008 More of the same but better! • 13 systems, 26 servers, 52 cpus, 208 cores. Intel 5420 Harpertown cpu’s provide ~540KSI2K, 16GB Low Voltage FBDIMM memory for each server. Each server has a 500GB SATA HD. • 3 servers each providing 20TB usable storage after RAID 6, total ~60TB, Areca Controllers • 3 3com 5500 switches and backplane interconnects SouthGrid Status June 2009 13 Grid Cluster setup SL5 test nodes available t2ce02 t2ce04 t2ce05 t2torque02 t2torque T2wn40 T2wn86 T2wn40 T2wn40 T2wn40 T2wn40 T2wn85 Glite 3.1 SL4 SouthGrid Status June 2009 T2wn87 Glite 3.2 SL5 14 Nov 2008 Upgrade to the Oxford Grid Cluster at Begbroke Science Park SouthGrid Status June 2009 15 Local PP Cluster (tier 3) • Nov 2008 Upgrade same h/w as the Grid – – – – – 3 Storage Nodes 8 Twins 3 3com 5500 switches with backplane interconnects 100Mb/s switches used for the management cards (ipmi and RAID) APC Rack; Very easy to mount the APC pdu’s • Still running SL4 but have a test SL5 system for users to try. We are ready to switch over when we have to. • Lustre FS not yet implemented due to lack of time. SouthGrid Status June 2009 16 Electrical Power consumption • Newer generation Intel Quads take less power • Tested using one cpuburn process per core on both sides of a twin killing a process every 5 minutes. Busy 645W Idle 410W Busy 490W Idle 320W Intel 5345 Intel 5420 SouthGrid Status June 2009 17 Electricity Costs* • We have to pay for the electricity used at the Begbroke Computer Room: • Cost in electricity to run old (4 years) Dell nodes is ~£8600 per year. (~79 KSI2k) • Replacement cost in new twins is ~£6600 with electricity cost of ~£1100 per year. • So saving of ~£900 in the first year and £7500 per year there after. • Conclusion is, its not economically viable to run kit older than 4 years. * Jan 2008 figures SouthGrid Status June 2009 18 IT related power saving • Shutting down desktops when idle – Must be idle, logged off, no shared printers or disks, no remote access etc. – 140 machines regularly shut down – Automatic power up early in the morning to apply patches and get ready for user (using Wake-On-LAN) • Old cluster nodes removed/replaced with more efficient servers • Virtualisation reduces number of servers and power. • Computer room temperatures raised to improve A/C efficiency (from 19C to 23-25C) • Windows 2008 server allows control of new power saving options on more modern desktop systems SouthGrid Status June 2009 19 CPU Benchmarking HEPSPEC06 hostname cpu type memory no of cores hepspec06 hepspec06/core node10 2.4GHZ zeon 4GB 2 7 3.5 node10 2..4GHz 4GB 2 6.96 3.48 t2wn61 E5345 2.33GHz 16GB 8 57.74 7.22 pplxwn16 E5420 2.5GHz 16GB 8 64.88 8.11 pplxint3 E5420 2.5GHZ 16GB 8 64.71 8.09 These figures match closely with those published on http://www.infn.it/CCR/server/cpu2006-hepspec06-table.jpg Nehalem Dell server just arrived for testing. SouthGrid Status June 2009 20 Cluster Usage at Oxford Roughly equal share between LHCb and ATLAS for CPU hours. ATLAS runs many short jobs. LHCb longer jobs. Cluster occupancy approx 70% so still room for more jobs. Local contribution To Atlas MC storage SouthGrid Status June 2009 21 Oxford recently had its network link rate capped to 100mbs 200mbps In This was as a result of continuous 300350mbs traffic caused by CMS commissioning stress testing. As it happens this test completed at the same time as we were capped, so we passed the test, and current normal use is not expected to be this high Oxfords Janet link is actually 2 * 1gbit links which had become saturated. Short term solution is to only rate cap JANET traffic to 200mbs which doesn’t impact on normal working (for now) out all other on site traffic remains at 1gbs. Long term plan is to upgrade the JANET link to 10gbs within the year. SouthGrid Status June 2009 22 gridppnagios Have setup a nagios monitoring site for the UK which several other sites use to get advanced warnings of failures. https://gridppnagios.physics.ox.ac.uk/nagios/ https://twiki.cern.ch/twiki/bin/view/LCG/Gr idServiceMonitoringInfo SouthGrid Status June 2009 23