Mark Leese STFC

Download Report

Transcript Mark Leese STFC

Hosting Large-scale
e-Infrastructure Resources
Mark Leese
[email protected]
Contents
• Speed dating introduction to STFC
• Idyllic life, pre-e-Infrastructure
• Sample STFC hosted e-Infrastructure
projects
• RAL network re-design
• Other issues to consider
STFC
• One of seven publicly funded UK Research
Councils
• Formed from 2007 merger of CCLRC and PPARC
• STFC does a lot, including…
– awarding research, project & PhD grants
– providing access to international science facilities
through its funded membership of bodies like CERN
– shares it expertise in areas such as materials and
space science with academic and industrial
communities
• …but it is mainly recognised for hosting large scale
scientific facilities, inc. High Performance
Computing (HPC) resources
Harwell Oxford Campus
-
STFC major shareholder in
Diamond Light Source
Electron beam accelerated to
near light speed within ring
Resulting light (X-Ray, UV or
IR) interacts with samples
being studied
-
ISIS
‘super-microscope’ employing
neutron beams to study
materials at atomic level
Harwell Oxford Campus
-
-
STFC’s Rutherford Appleton Lab is part of
Harwell Oxford Science and Innovation
Campus with UKAEA, and commercial campus
management company
Co-locate hi-tech start-ups and multi-national
organisations alongside established scientific
and technical expertise
-
Similar arrangement at Daresbury in Cheshire
Both within George Osbourne Enterprise
Zones:
Reduced business rates
Government support for roll out of super
fast broadband
Previous
Experiences
Large Hadron Collider
16.5
miles
CMS
ALICE
LHCb
ATLAS
• LHC at CERN
• Search for
elementary but
hypothetical Higgs
boson particle
• Two proton
(hadron) beams
• Four experiments
(particle detectors)
• Detector electronics
generate data
during collisions
LHC and Tier-1
• After initial processing,
the four experiments
generated 13 PetaBytes
of data in 2010 (> 15m
GB or 3.3m single layer
DVDs)
• In last 12 months, Tier-1
received ≈ 6 PBs from
CERN and other Tier-1s
• GridPP contributes
equivalent of 20,000 PCs
UK Tier-1 at RAL
ISP
Front
Door
Security
Internal
Distribution
Site
Access
Router
Firewall
Router
A
Primary
Janet
Backup
Tier-1 to Tier-2s (universities)
CERN
LHC
OPN
Optical
Private
Network
•
•
10 Gbps
lightpath
Backup
UK
Light
Router
LHC data
RAL
Site
PetaBytes?!?
“Normal”
data
Tier-1
Tier-0 &
other Tier-1s
Individual Tier-1 hosts route data to
routers A or UKLight as appropriate
Config pushed out with Quattor
Grid/cluster management tool
•
•
Access Control Lists of IP address
on SAR, UKLight router and/or hosts
replaces firewall security
As Tier-2 (universities) network
capabilities increase, so must RAL’s
(102030 Gbps)
LOFAR
- LOw Frequency Array
- World's largest and most
sensitive radio telescope
- Thousands of simple dipole
antennas, 38 European arrays
- 1st UK array opened at
Chilbolton, Sept 2010
- 7 PetaBytes a year raw data
generated (> 1.5m DVDs)
- Data transmitted in real-time
to IBM BlueGene/P super
computer at Uni of Groningen
- Data processed & combined
in software to produce images
of the radio sky
LOFAR
- 10 Gbps Janet Lightpath
- Janet  GÉANT  SURFnet
- Big leap from FedEx’ing data
tapes or drives
- 2011 RCUK e-IAG
“Southampton and UCL make
specific reference ... quicker
to courier 1TB of data on a
portable drive”
- Funded by LOFAR-UK
- cf. LHC: centralised not
distributed processing
- Expected to pioneer approach
for other projects, e.g. Square
Kilometre Array
Sample STFC
e-Infrastructure
Projects
ICE-CSE
• International Centre of Excellence for Computational Science
and Engineering
• Was going to be Hartree Centre, now DFSC
• STFC Daresbury Laboratory, Cheshire
• Partnership with IBM
• Mission to provide HPC resources and develop software
• DL previously hosted HPCx, big academic HPC before HECToR
• IBM BlueGene/Q supercomputer
• 114,688 processor cores, 1.4 Petaflops peak performance
• Partner IBM’s tests were first time a Petaflop application has
been run in the UK (one thousand trillion calculations per
second)
• 13th in this year’s TOP500 worldwide list
• Rest of Europe appears five times in Top 10
• DiRAC and HECToR (Edinburgh) 20th and 32nd
ICE-CSE
•
•
DL network upgraded to support up to 8 * 10 Gbps lightpaths to current regional Janet
deliverer, Net North West, in Liverpool and Manchester
Same optical fibres, different colours of light:
1.
2.
3.
4.
5.
10G JANET IP service (primary)
10G JANET IP service (secondary)
10G DEISA (consortium of European supercomputers)
10G HECToR (Edinburgh)
10G ISIC (STFC-RAL)
More expected as part of IBM-STFC collaboration
•
•
Feasible because NNW rents its own dark (unlit) fibre network
NNW ‘simply’ change the optical equipment on each end of the dark fibre
•
•
Key aim is for machine and expertise to be available to commercial companies
How? Over Janet?
•
A Strategic Vision for UK e-Infrastructure estimates that 1,400 companies could make
use of HPC, with 300 quite likely to do so
So even if some instead go for the commercial “cloud” option...
•
JASMIN & CEMS
• Joint Analysis System Meeting
Infrastructure Needs
• JASMIN and CEMS funded by BIS
through NERC, and UKSA and ISIC
respectively
• Compute and storage cluster for the
climate and earth system modelling
community
Big compute and
storage cluster
4.6 PetaBytes
fast disc storage
JASMIN will talk
internally to other
STFC resources
JASMIN will talk to
its satellite systems
150 TB
compute + 500 TB
150 TB
JASMIN will talk to
the Nederlands, the
MET Office &
Edinburgh over
UKLight
CEMS in the ISIC
•
•
•
•
Climate and Environmental Monitoring from Space
Essentially JASMIN for commercial users
Promote use of ‘space’ data and technology within new market sectors
Four consortia already won funding from public funded ‘Space for
Growth’ competition (run by UKSA, TSB and SEEDA)
•
•
•
Hosted in International Space Innovation Centre
A ‘not-for-profit’ formed by industrials, academia and government.
Part of UK’s Space Innovation and Growth Strategy to grow the
sector’s £turnover
•
•
ISIC is STFC ‘Partner Organisation’ in terms of Janet Eligibility Policy
So... Janet-BCE (Business and Community Engagement) for network
access related to academic and ISIC partners
Commercial ISP for network access related to commercial customers
As the industrial collaboration agenda is pushed, this needs to be
controlled and applicable elsewhere in STFC
•
•
Janet
BT
Janet & Janet-BCE traffic
Commercial
traffic
RAL
Infrastructure
10 Gbps fibre
Commercial
customers
VLAN
No CEMS
traffic permitted
Janet-BCE
VLAN
Rtr
JASMIN
10 Gbps fibre
ISIC
Sw
CEMS
Rtr
• JASMIN and CEMS
connected at 10 Gbps…
• …but no Janet access
for CEMS via JASMIN
• Keeping Janet
‘permitted’ traffic as
separate BCE VLAN
allows tighter control
• Customers will access
CEMS on different IP
addresses depending on
who they are (academia,
partners, commercials)
• This could be enforced
RAL Network Re-Design
& Other Issues
RAL Network Re-Design
ISIS
Janet
Site
Access
Router
The
Outside
World
RAL PoP
CERN
LHC
OPN
UKLight
Router
Internal
Router A
Distribution
Firewall
RAL
Site
Admin
“Normal” data
JASMIN
LHC data
Tier-1
Tier-1
Two main aims:
1. Resilience: Reduce serial paths and single points of failure.
2. Scalability and flexibility: Remove need for special cases.
Make adding bandwidth and adding ‘clouds’ (e.g. Tier-1 or
tenants) a repeatable process with known costs.
External Connectivity
Site Access &
Distribution
Security
Internal
Distribution
Site
Visitors
Campus
Rtr
Implicit trust
relationship =
bypass firewall
Virtual
firewall
Primary
Janet
Backup
CERN
LHC
OPN
RAL PoP:
Rtr 1
Campus
Access
&
Distribution
Rtr 2
Project,
Facility,
Dept
Sw 1
Internal
Site
Distribution
Rtr
A
Rest of
RAL site
Sw 2
Primary
Rtr
Commer
-cial ISP
Tenants
Tier-1
Rtr 1 & 2, Sw1 & 2
Front: 48 ports 1/10 GbE (SFP+)
Back: 4 ports 40 GbE (QSFP+)
• Lots of 10 Gigs:
– clouds and new providers can be readily added
– bandwidth readily added to existing clouds
– clouds can be dual connected
RAL Site Resilience
Backup to
London
Primary to
Reading
500 ft
100m
User Education
• Belief that you can plug a node or cluster into “the network” and
be immediately firing lots of data all over the world is a fallacy
• Over provisioning is not a complete solution
• Having invested £m’s elsewhere, most network problems that
do arise are within the last mile: campus network  individual
devices  applications
• On the end systems...
–
–
–
–
–
–
Network Interface Card
Hard disc
TCP configuration
Poor cabling
Does your application use parallel TCP streams?
What protocols does your application use for data transfer
(GridFTP, HTTP...)?
• Know what to do on your end systems
• Know what questions to ask of others
User Support
•
2010 example: CMIP5 - RAL Space sharing environmental data with
Lawrence Livermore (West coast US) and DKRZ (Germany)
–
–
–
–
–
•
•
•
ESNet, California  GÉANT, London
800 Mbps
ESNet, California  RAL Space
30 Mbps
RAL Space  DKRZ, Germany
40Mbps
So RAL is the problem right? Not necessarily...
DKRZ, Germany  RAL Space
up to 700Mbps
Involved six distinct parties: RAL Space, STFC Networking, Janet,
DANTE, ESNet, LLNL
Difficult, although the experiences probably fed into the aforementioned
JASMIN
Tildesley’s Strategic Vision for UK e-Infrastructure talks of “the
additional effort to provide the skills and training needed for advice and
guidance on matching end-systems to high-capacity networks”
I’ll do anything for a free lunch
• Access Control and Identity Management
– During DTI’s e-Science programme access to
resources was often controlled using personal
X.509 certificates
– Is that scalable?
– Will you run or pay for a PKI?
– Resource providers may want to try Moonshot
• extension of eduroam technology
• users of e-Infrastructure resources authenticated
with user credentials held by their employer
• Will the Janet Brokerage be applicable to
HPC e-Infrastructure resources?
Conclusions
From the STFC networking perspective:
• Adding bandwidth should be repeatable
process with known costs
• Networking is now a core utility, just like
electricity: plan for resilience on many levels
• Plan for commercial interaction
• In all the excitement don’t forget security
• e-Infrastructure funding is paying for capital
investments - be aware of the recurrent costs