HEPiX summary

Download Report

Transcript HEPiX summary

HEPiX summary
GridPP10 meeting, CERN, 4 June 2004
David Kelsey
CCLRC/RAL, UK
[email protected]
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
1
Overview
• “Summary” of the recent
HEPiX Spring meeting in Edinburgh (24-28 May)
• Not a full summary!
– rather an overview and some topics for GridPP
• What is HEPiX?
• General comments and agenda details
• Some specific topics in a little more detail
– Linux plans (RedHat Enterprise Linux etc)
– Mass Storage Workshop
• Agenda
• LCG Service Challenges & WAN data movement
• Future HEPiX meetings
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
2
What is HEPiX?
• A global organisation (started in 1991)
– service managers and support staff providing computing
facilities for HEP
– Some participation by HEP User/Experiment community
• All operating systems used by HEP are now covered, including
UNIX, Windows and Grid computing
• For several years was called HEPiX-HEPNT
• Important for LCG Deployment coordination
• Two meetings per year (Europe and North America)
• Meetings allow participants to present recent work and future
plans and to share experiences with other attendees
• Open e-mail list
– [email protected]
– Join via usual Listserv request
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
3
HEPiX networking
• Social networking is an
important part of the
meeting…
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
4
General Comments
• Spring 2004 HEPiX meeting
– Hosted at the eScience Institute of the
UK National eScience Centre (NeSC) in Edinburgh,
Scotland
– Organised by RAL (DPK)
• ~100 people attended
http://hepwww.rl.ac.uk/hepix/nesc/participants.htm
– Record attendance!
• Agenda: http://hepwww.rl.ac.uk/hepix/nesc/agenda.htm
– all talks on web
– Video streaming archive coming soon
• Format
– 3 full days of HEPiX presentations and panels
– 1.5 days of Large Cluster SIG focussed meeting
• Mass Storage Workshop (organised by Olof Barring)
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
5
Agenda (1)
• Topics covered (HEPiX 3 days)
– Introduction to NeSC & 16 Site reports (most of first day)
– Two major sessions on Red Hat Linux plans
• The main topic of the week
– FNAL Scientific Linux, Next CERN Linux
– 4 Grid talks
• LCG Update, LCG Testing, LCG User Registration
• GridPP status
– Fabric: ELFms, Lemon
– Windows: Citrix & Remote access, SUS/SMS
– Disk/Tape storage service, Disk performance & high-speed
interconnects
– AFS Workshop, AFS Authenticated Remote control (file
space admin using SASL, GSSAPI etc)
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
6
Agenda (2)
• Other talks included:
– Bob Cowles’ traditional “Security Update”
• Both worrying and entertaining!
–
–
–
–
–
–
–
–
–
Experiences CDF hardware and service contract
FNAL High Density Computing plans
CERN Solaris issues
NERSC PDSF developments
FNAL YUM to improve Linux security
CERN CVS
IN2P3 BQS
LAL Anti-SPAM
CERN InDiCo
• Integrated Conference Management
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
7
RedHat Linux
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
8
Red Hat Linux
• Background
– LCG-2 (and others) currently based on RedHat
7.3
• 7.3 has reached end of life
• Support and security patches no longer available
– RedHat Enterprise Linux V3 available since Oct
03
• But have to purchase support via subscription to install
binaries and obtain updates
– Initial discussions with RedHat at Oct 2003 HEPiX
• Can we negotiate an HEP-wide solution?
• Experiments validate specific distributions
– Many sites have to support multiple experiments
– Great desire (by mid-size sites) to standardise
• But sites need to customise the environment
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
9
Presentation by Nathan Jones
(RedHat Inc)
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
10
Red Hat (2)
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
11
Red Hat (3)
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
12
Red Hat (4)
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
13
SLAC Linux Plans
Chuck Boeheim
(SLAC was the first site to negotiate a deal
with RedHat)
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
14
What SLAC is Getting






Entitlements to cover farms, servers, desktops,
remote collaborators
Technical Account Manager
Weekly Service Meeting
Advocacy with developers for items of future
interest: Inifiniband, AFS support in 2.6
Deployment and troubleshooting help
Annual cost < 1 FTE
4-Jun-04
D.P.Kelsey, HEPiX summary,
GridPP10
15
Other Factors
BaBar wanted consistent platform for
collaborators
 Prompt response to security alerts
 Control rate of change (maybe)
 Recognition of value we get from Red Hat
 HEP consensus is important

4-Jun-04
D.P.Kelsey, HEPiX summary,
GridPP10
16
Fermilab Linux Plans
Mark O. Kaletka
HEPiX Spring 2004, Edinburgh
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
17
Fermilab Linux Plans
• “Scientific Linux” is:
– See Connie Sieh’s excellent talk for technical details
• http://hepwww.rl.ac.uk/hepix/nesc/sieh1.pdf
– A natural outgrowth of Fermi Linux
• And an abstraction layer on top of Fermi Linux
– “Workgroup” > “Site”
– Offered to the HEP community as an alternative
• Much of this is work we would do anyway, others should benefit
• But, this may not be right for everyone
– In the spirit of Open Source
• Conforms to GPL license & Red Hat’s trademark guidelines
– An opportunity to work together
• Intended to be a community support effort
– But this will require some formal infrastructure & coordination
• Not an offer by Fermilab to support everyone!
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
18
Connie Sieh’s talk
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
19
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
20
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
21
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
22
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
23
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
24
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
25
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
26
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
27
CERN – Jan Iven
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
28
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
29
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
30
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
31
Linux
DL5 (SuSE 8.2) rollout in progress (25% done)
support for base distribution ends April 2004
9.0 patches will help for another 6 months
successor - better: continuation - needed early next year
DL5 is most likely the last DESY Linux based on SuSE
if a common HEP distribution with long lifetime is available
and affordable, that's what we'll use
started looking at Scientific Linux
thanks to Fermilab for providing this!
current version seems very compatible with DL5 (for users)
purchase of licenses is an option - if price/value ratio ok
RH Linux (Edinburgh)
• Red Hat offer for worldwide HEP RHEL WS subscription is now
available
– SLAC and CERN evaluating the value
– Many sites will wait for feedback
– Still concerns about lack of a site “licence”
• Or payment per support request
• Discussions during HEPiX meeting resulted in a draft joint
FNAL/CERN proposal to LCG/EGEE
– “The Edinburgh Accord”
– http://cern.ch/hepix/linuxProposal.txt
– Standardise on Scientific Linux as base Linux platform
– Separate site customisation from the common core
– Common core packages binary compatible with RHEL
– With care, experiments can validate on one and run on any
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
33
Mass Storage Workshop
• 1.5 days – Meeting of HEPiX Large Cluster SIG
• Agenda included
– Two vendor talks (IBM and STK)
– Results from Caspur Storage Lab
• Results of many tests - lots of info
– Overview of SRB
– EDG WP5/GridPP experiences with SRM/SRB/SE
and multiple interfaces (Jensen)
– GFAL and LCG data management
– Mass Storage and WAN data movement
– Storage systems and high performance networks
– Integrating dCache at FZK
– Castor SRM v1.1
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
34
IBM TotalStorage SAN
File System
•Overview
•SAN FS for Grid & HPC
Paul L. Bradshaw
Breakthrough to On Demand with IBM TotalStorage
IBM Almaden Research Center
© 2004 IBM Corporation
IBM TotalStorage® Open Software Family
IBM TotalStorage Open Software Family
Taking steps toward an On Demand storage environment
Storage Orchestration
Storage
Hierarchical
Archive
Recovery
Infrastructure
Storage
Management Management Management Management
for
for
for
for
Data Fabric Disk Replication
for
Files
for
Files
for
Mail
for
SAP
for
for
for
Files Databases Application
Servers
for
Mail
for
SAP
Storage Virtualization
SAN
Volume
Controller
36
Evolving to an on demand environment
SAN
File
System
© 2004 IBM Corporation
IBM Systems and Technology Group
Public
Internet/
Intranet
Clients
Current IT Environment
Firewall
IBM eServer™
zSeries®
Layer 4-7
Switches
Web
Servers
FS 1
FS 2
SSL Appliances
Layer 2
Switches
File/Print
Servers
FS 4
FS 5
FS 6
FS 7
Caching Appliances
FS 8
FS 9
FS 10
FS 11
FS 3
Storage
Fibre
Switches
Storage
Fibre
Switches
FS = Multiple, different File Systems
across servers, with individual
interfaces
AF = Multiple, different Advanced
Functions across storage
devices with individual interfaces
37
Routers (Layer
3 Switches)
© 2004 IBM Corporation
Storage Area Network
AF
AF
AF AF AF
AF
Increasing
complexity of
deployment,
access, and
management of
IT infrastructure
AF
IBM
HP
IBM
HDSHDSEMC
HP
IBM TotalStorage Open Software Family
IBM Systems and Technology Group
Public
Internet/
Intranet
Clients
Consolidate Server
Infrastructure
Firewall
Routers (Layer
3 Switches)
Layer 4-7
Switches
Web
Servers
SSL Appliances
Layer 2
Switches
Caching Appliances
FS 1
FS 2
FS 3
FS 5
FS 6
FS 7
FS 4
FS 4
FS 8
FS 9
FS 10
FS 11
FS 1
FS 2
FS 3
File/Print
Servers
Storage
Fibre
Switches
Consolidate to
pSeries, xSeries, etc.
Consolidate
to zSeries
Consolidate
into
SANFile Systems
FSstorage
= Multiple,
different
38
© 2004 IBM Corporation
FS 9
FS 10
FS 11
Consolidate servers
and distribute
Consolidate to
workloads to most
BladeCenter
appropriate platforms
Storage
Fibre
Switches
Storage Area Network
across servers, with individual
interfaces
AF = Multiple, different Advanced
Functions across storage
devices with individual interfaces
FS 8
AF
IBM
AF
AF
AF AF AF
AF
AF
IBM
HP
IBM
HDSHDSEMC
HP
IBM TotalStorage Open Software Family
WAN RAW/ESD Data Distribution for LHC
Bernd Panzer-Steindel
39
Proposed timescales and scheduling
Midyear
2004
2005
2006
Endyear
10Gbit “end-to-end” tests
with Fermilab
First version of the LHC
Community Network proposal
10Gbit “end-to-end” test complete with European
Partner
Measure performance variability and understand
H/W and S/W Issues to ALL sites.
Document circuit switched options and costs, first
real test if possible.
Circuit/Packet switch design
completed.
LHC Community network
proposal completed.
All T1 Fabric architecture
documents completed.
LCG TDR completed
Sustained throughput test achieved to some sites:
2-4 Gb/sec for 2 months. H/W and S/W problems
solved.
All CERN b/w provisioned.
All T1 bandwidth in
production (10Gb links)
Sustained throughput tests
achieved to most sites.
Verified performance to all sites for at least 2
months.
Bernd Panzer-Steindel
40
Data Management Service
Challenge
LCG
Scope

Networking, file transfer, data management

Storage management - interoperability

Fully functional storage element (SE)
Layered Services





HEPiX Edinburgh 28 May
2004
Network
Robust file transfer
Storage interfaces and functionality
Replica location service
Data management tools
les robertson - cern-it-41
Short Term Targets
LCG



Now (or next week) –

Participating sites with contact names

Agreed ramp-up plan, with milestones – 2-year horizon
End June –
Targets for end 2004 –
1. SRM-SRM (disk) on 10 Gbps links between CERN, Triumf, FZK,
FNAL, NIKHEF/SARA  500 MB/sec (?) sustained for days
2. Reliable data transfer service
3. Mass storage system <-> mass storage system
SRM v.1 at all sites
disk-disk, disk-tape, tape-tape
4. Permanent service in operation
sustained load (mixed user and generated workload)
> 10 sites
key target is reliability
load level targets to be set
1.
2.




HEPiX Edinburgh 28 May
2004
les robertson - cern-it-42
Exporting Raw/ESD data from
Tier-0  Tier-1s
Wrap-up
Olof Barring
43
The problem (Bernd)
• One copy of the LHC raw data for each of the
LHC experiments is shared among the Tier-1’s
• Full copies of the ESD data (1/2 of raw data size)
• Total ~10PB/year exported from CERN
• The full machinery for doing this automatically
should be in place for full-scale tests in 2006
Olof Barring
44
Tier-1 resources
TRIUMF
• 2 machines purchased, 1Gbit(?)
RAL
• Gbit link at present
• Parallel activities from ATLAS and CMS
• Not enough effort to dedicate for the moment
• More hardware in September
FNAL
• Just finished CMS DC – very labor intensive
• Enough resources to sustain 2TB/day
GridKA
• 1Gbit at present, expanding to 10Gbit in October/November
• Storage system is ready (dCache + TSM)
BNL
• SRM service almost ready (in a month)
•One gridftp node
• OC12 connection, not much used
NIKHEF/SARA
• 10Gbit since more than a year
•Running data challenges for experiments but mainly CPU intensive
IN2P3/Lyon
• Not yet ready with interface to MSS
• 1Gbit
Olof Barring
45
Agreed tests
1.
2.
3.
4.
Simple disk-to-disk, peer-to-peer
Simple disk-to-disk, one-to-many
MSS-to-MSS
In parallel?
a) Transfer scheduling
b) Replica catalogue & management
Olof Barring
46
Timescales
TRIUMF
RAL
FNAL
June
July
one-to-many?
BNL
Aug
Transfer
scheduling
Sept
Oct
SRM-Basic
ready(?)
Olof
Barring
Nov
Dec
Jan’05
47
Future HEPiX meetings
• Fall 2004 meeting (USA)
– At BNL, 18-22 October 2004
– Large Cluster SIG (1 day) proposed to cover
“Platform Technology” (h/w and s/w)
•
•
•
•
•
is there a role for MacOS?
why Itanium?
AMD or Intel?
32 or 64 bit?
Etc …
• Spring 2005 meeting (Europe)
– To be hosted by FZK (Karlsruhe)
• Date still to be fixed (2nd or 4th week of May 2005)
4-Jun-04
D.P.Kelsey, HEPiX summary, GridPP10
48