UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN RAL 30th June 2009

Download Report

Transcript UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN RAL 30th June 2009

UKI-SouthGrid Overview
and Oxford Status Report
Pete Gronbech
SouthGrid Technical Coordinator
HEPSYSMAN RAL
30th June 2009
SouthGrid Tier 2
• The UK is split into 4 geographically distributed tier 2
centres
• SouthGrid comprise of all the southern sites not in London
• New sites likely to join
SouthGrid Status June 2009
2
UK Tier 2 reported CPU
– Historical View to present
4000000
3500000
3000000
2500000
UK-London-Tier2
UK-NorthGrid
2000000
UK-ScotGrid
1500000
UK-SouthGrid
1000000
500000
0
Jul-08
Aug08
Sep- Oct-08 Nov-08 Dec- Jan-09 Feb-09 Mar-09 Apr-09 May- Jun-09
08
08
09
K SPEC int 2000 hours
SouthGrid Status June 2009
3
SouthGrid Sites
Accounting as reported by APEL
700000
600000
500000
JET
BHAM
400000
BRIS
CAM
300000
OX
200000
RALPPD
100000
0
May08
Jun- Jul-08 Aug08
08
Sep08
Oct08
Nov08
Dec08
Jan09
Feb09
Mar09
Apr09
May09
Jun09
K SPEC int 2000 hours
SouthGrid Status June 2009
4
Site Upgrades in the last 6 months
• RALPPD Increase of 960 cores (1716KSI2K) +380TB
• Cambridge 32 cores (83KSI2K) + 20TB
• Birmingham 64 cores on pp cluster and 192 cores HPC cluster
which add ~535KSI2K
• Bristol original cluster replaced by new quad cores systems
24 cores + increased share of the HPC cluster 125KSI2k +
44TB
• Oxford extra 208 cores 540KSI2K + 60TB
• Jet extra 120 cores 240KSI2K
SouthGrid Status June 2009
5
New Total Q209
SouthGrid
GridPP
CPU (kSI2K)
Storage (TB)
% of MoU CPU
% of MoU Disk
EDFA-JET
483
1.5
Birmingham
728
90
244.30%
105.88%
Bristol
192
55
100.52%
177.42%
Cambridge
455
60
245.95%
136.36%
Oxford
972
160
309.55%
213.33%
RALPPD
2743
633
208.91%
207.54%
Totals
5573
999.5
242.20%
185.09%
SouthGrid Status June 2009
6
Site Setup Summary
Site
Cluster/s
Installation
Method
Batch
System
Birmingham Dedicated & Shared HPC PXE, Kickstart,
CFEngine. Tarball
for HPC
Torque
Bristol
Small Dedicated &
Shared HPC
PXE, Kickstart,
CFEngine. Tarball
for HPC
Torque
Cambridge
Dedicated
PXE, Kickstart,
custom scripts
Condor
JET
Dedicated
Kickstart, custom
scripts
Torque
Oxford
Dedicated
PXE, Kickstart,
CFEngine
Torque
RAL PPD
Dedicated
PXE, Kickstart,
CFEngine
SouthGrid Status June
2009
Torque
7
New Staff
• Jan 2009 Kashif Mohammad Deputy Technical Coordinator
based at Oxford
• May 2009 Chris Curtis SouthGrid Hardware support based at
Birmingham
• June 2009 Bob Cregan HPC support at Bristol
SouthGrid Status June 2009
8
Oxford Site Report
SouthGrid Status June 2009
9
Oxford Central Physics
• Centrally supported Windows XP desktops (~500)
• Physics wide Exchange Server for email
– BES to support Blackberries
• Network services for MAC OSX
– Astro converted entirely to Central Physics IT services (120 OSX systems)
– Started experimenting with Xgrid
• Media services
– Photocopiers/printers replaced – much lower costs than other departmental
printers.
• Network
– Network is too large. Looking to divide into smaller pieces – better
management and easier to scale to higher performance.
– Wireless – introduced EDUROAM on all physics WLAN base stations.
– Identified problems with 3com 4200G switch which caused a few connections
to run very slowly. Now fixed.
– Improved network core and computer room with redundant pairs of 3com
SouthGrid Status June 2009
5500 switches.
10
Oxford Tier 2 Report
Major Upgrade 2007
• Lack of decent Computer room with
adequate power and A/C held back
upgrading our 2004 kit until Autumn 07
• 11 systems, 22 servers, 44 cpus, 176
cores. Intel 5345 Clovertown cpu’s
provide ~430KSI2K, 16GB memory for
each server. Each server has a 500GB
SATA HD with IPMI remote KVM cards.
• 11 servers each providing 9TB usable
storage after RAID 6, total ~99TB,
3ware 9650-16ML controller.
• Two racks, 4 Redundant Management
SouthGrid
Status June 2009
Nodes, 4 APC 7953 PDU’s,
4 UPS’s
11
Oxford Physics now has two
Computer Rooms
• Oxford’s Grid Cluster initially housed
in the departmental Computer room
late 2007
• Later moved to the new shared
University room at Begbroke (5 miles
up the road)
SouthGrid Status June 2009
12
Oxford Upgrade 2008
More of the same but better!
• 13 systems, 26 servers, 52 cpus, 208
cores. Intel 5420 Harpertown cpu’s
provide ~540KSI2K, 16GB Low Voltage
FBDIMM memory for each server. Each
server has a 500GB SATA HD.
• 3 servers each providing 20TB
usable storage after RAID 6, total
~60TB, Areca Controllers
• 3 3com 5500 switches and
backplane interconnects
SouthGrid Status June 2009
13
Grid Cluster setup
SL5 test nodes available
t2ce02
t2ce04
t2ce05
t2torque02
t2torque
T2wn40
T2wn86
T2wn40
T2wn40
T2wn40
T2wn40
T2wn85
Glite 3.1 SL4
SouthGrid Status June 2009
T2wn87
Glite 3.2 SL5
14
Nov 2008 Upgrade to the Oxford Grid
Cluster at Begbroke Science Park
SouthGrid Status June 2009
15
Local PP Cluster (tier 3)
• Nov 2008 Upgrade same h/w as the Grid
–
–
–
–
–
3 Storage Nodes
8 Twins
3 3com 5500 switches with backplane interconnects
100Mb/s switches used for the management cards (ipmi and RAID)
APC Rack; Very easy to mount the APC pdu’s
• Still running SL4 but have a test SL5 system for users to try.
We are ready to switch over when we have to.
• Lustre FS not yet implemented due to lack of time.
SouthGrid Status June 2009
16
Electrical Power consumption
• Newer generation Intel Quads take less power
• Tested using one cpuburn process per core on both sides of a
twin killing a process every 5 minutes.
Busy 645W
Idle 410W
Busy 490W
Idle 320W
Intel 5345
Intel 5420
SouthGrid Status June 2009
17
Electricity Costs*
• We have to pay for the electricity used at the
Begbroke Computer Room:
• Cost in electricity to run old (4 years) Dell nodes is
~£8600 per year. (~79 KSI2k)
• Replacement cost in new twins is ~£6600 with
electricity cost of ~£1100 per year.
• So saving of ~£900 in the first year and £7500 per
year there after.
• Conclusion is, its not economically viable to run kit
older than 4 years.
* Jan 2008 figures
SouthGrid Status June 2009
18
IT related power saving
• Shutting down desktops when idle
– Must be idle, logged off, no shared printers or disks, no remote
access etc.
– 140 machines regularly shut down
– Automatic power up early in the morning to apply patches and get
ready for user (using Wake-On-LAN)
• Old cluster nodes removed/replaced with more efficient
servers
• Virtualisation reduces number of servers and power.
• Computer room temperatures raised to improve A/C
efficiency (from 19C to 23-25C)
• Windows 2008 server allows control of new power saving
options on more modern desktop systems
SouthGrid Status June 2009
19
CPU Benchmarking HEPSPEC06
hostname
cpu type
memory
no of cores
hepspec06
hepspec06/core
node10
2.4GHZ zeon
4GB
2
7
3.5
node10
2..4GHz
4GB
2
6.96
3.48
t2wn61
E5345 2.33GHz
16GB
8
57.74
7.22
pplxwn16
E5420 2.5GHz
16GB
8
64.88
8.11
pplxint3
E5420 2.5GHZ
16GB
8
64.71
8.09
These figures match closely with those published on
http://www.infn.it/CCR/server/cpu2006-hepspec06-table.jpg
Nehalem Dell server just arrived for testing.
SouthGrid Status June 2009
20
Cluster Usage at Oxford
Roughly equal share between
LHCb and ATLAS for CPU
hours.
ATLAS runs many short jobs. LHCb
longer jobs.
Cluster occupancy approx 70% so
still room for more jobs.
Local contribution
To Atlas MC storage
SouthGrid Status June 2009
21
Oxford recently had its network link rate
capped to 100mbs
200mbps
In
This was as a result of continuous 300350mbs traffic caused by CMS
commissioning stress testing.
As it happens this test completed at the
same time as we were capped, so we
passed the test, and current normal use
is not expected to be this high
Oxfords Janet link is actually 2 * 1gbit
links which had become saturated.
Short term solution is to only rate cap
JANET traffic to 200mbs which doesn’t
impact on normal working (for now)
out
all other on site traffic remains at 1gbs.
Long term plan is to upgrade the JANET
link to 10gbs within the year.
SouthGrid Status June 2009
22
gridppnagios
Have setup a nagios monitoring site for the UK
which several other sites use to get advanced
warnings of failures.
https://gridppnagios.physics.ox.ac.uk/nagios/
https://twiki.cern.ch/twiki/bin/view/LCG/Gr
idServiceMonitoringInfo
SouthGrid Status June 2009
23