STFC DiRAC Facility

Download Report

Transcript STFC DiRAC Facility

DiRAC@GRIDPP
• www.dirac.ac.uk – please look at our wiki!
• Part of the new national e-Infrastructure
http://www.bis.gov.uk/policies/science/science-funding/elc
• Provides Advanced IT services for Theoretical and Simulation
research in Particle Physics, Particle Astrophysics,
Cosmology, Astrophysics and Solar System Science [90%]
• And for Industry, Commerce and Public Sector [10%]
• Given the mission of investigating and deploying cutting edge
technologies (hardware and software) in a production
environment
• Erk! We are a play pen as well as a research facility
Community Run





DiRAC uses the standard Research Council Facility
Model.
Academic Community oversight and supervision of
Technical staff
Regular reporting to Project Management Board.
This includes Science Outcomes
Outcome driven resource allocation -no research
outcomes no research
Sound Familiar?
Structure



Standard Facility Model
–
Oversight Committee (Meets twice yearly); Chair Foster (Cern)
–
Project Management Board (meets monthly); Chair Davies
(Glasgow)
–
Technical Working Group (every fortnight); Chair Boyle
(Edinburgh)
–
Project Director; Yates (UCL)
PMB sets strategy, policy and considers reports on
equipment usage and research outcomes – 10 members
TWG members deliver HPC services and undertake
projects for the Facility (8 members)
Computational Methods




Monte Carlo – particle based codes for CFD
and solution of PDEs
Matrix Methods – Navier-Stokes and
Shrodingers equations
Integrators – Runge-Kutta, ODEs
Numerical lattice QCD calculations using
Monte Carlo methods
Where are we going - DiRAC-2





22 September 2011; DiRAC invited to submit a 1 page
proposal to BIS to upgrade systems and make them part
of the baseline service for the new National EInfrastructure.
Assisted by STFC (Trish, Malcolm and Janet)
Awarded 14M for Compute and 1M for Storage. Strict
Payment Deadlines (31/03/2012) imposed.
Reviewed by Cabinet Office under the Gateway Review
Mechanism
Payment Profiles agreed with BIS
How has it gone

Owning kit and paying for admin support at our sites
works best – a hosting solution seems to get by far
the mostest.
–



Rapid deployment of 5 systems using local HEI
procurement
Buying access via SLA/MoU – is simply not as good
– just another user. SLAs don't always exist!
Excellent Research outcomes – papers are flowing
from the service
Had to consolidate from 14 systems to 5 systems
New Systems III
•
Final System Specs below. Costs include some
Data Centre Capital Works
System
(supplier)
Tflop/s
Connectivity
RAM
PFS
Cost
/£M
BG Q (IBM)
540 (Total
now 1300)
5d Torus
16TB
1PB
6.0
SMP (SGI)
42
NUMA
16TB
200TB
1.8
Data Centric
(OSF/IBM)
135
QDR IB
56TB
2PB
Usable
3.7
Data Analytic 50% of
(DELL)
200Tflops
FDR (Mell)
38TB
2PB
Usable
1.5
Complexity
(HP)
FDR (Mell)
36TB
0.8PB
2.0
90
User Access to Resources
• Now have Independent Peer Review System
• People apply for time; just like experimentalists do!
• First Call – 21 proposals! Over contention
• Central Budget from STFC for power and support staff (4FTE)
•
Will need to leverage users' local sys admin support to assist with DiRAC
• We do need a cadre of developers to parallelise and optimise
existing codes and to develop new applications
• Working with Vendors and Industry to attract more funds.
• Created a common login and reporting environment for DIRAC
using EPCC-DL SAFE system – crude identity management
TWG Projects




In Progress: Authentication/access for all our
users to the allowed resources. Using SSH and
Database updating cludge.
In Progress: Usage and System Monitoring –
using SAFE initially
Networking Testing in concert with Janet
GPFS Multi-Clustering – multi-clustering enables
compute servers outside the cluster serving a file
system to remotely mount that file system.
GPFS Multi-Clustering
• Why might we want it – you can use ls, cp, mv etc. Much
more intuitive for humble users and no ftp-ing ivolved.
• Does it work over long distances (WAN)? Weeell – perhaps
• Offer from IBM to test between Edinburgh and Durham, both DiRAC
GPFS sites. Would like to test simple data replication workflows.
• Understand and quantify identity and UID mapping issues. How YUK
are they? Can SAFE help sort these out or do we need something
else?
• Extend to the Hartree Centre, another GPFS site. Perform more
complex workflows.
• Extend to Cambridge and Leciester – non IBM sites. IBM want to
solve inter-operability issues
The Near Future
•
New Management Structure
–
•
Project Director (me) in place and Project Manager now funded
at 0.5FTE level. 1 FTE System Support at each of the four
Hosting Sites
Need to sort out sustainability of Facility
–
Tied to the establishment of the N E-I LC's 10 years Capital
and Recurrent investment programme (~April 2014?)
•
We can now perform Research Domain Leadership Simulations in
the UK
•
Need to federate access/authentication/monitoring systems middleware
that actually works, is easily usable AND is easy to manage.
•
Need to Federate Storage to allow easier workflow and data security
Some Theoretical Physics – The
Universe (well, 12.5% of it)