LCG Deployment
Download
Report
Transcript LCG Deployment
LCG Deployment
Ian Bird
IT Department, CERN
10th GridPP Meeting
CERN
4th June 2004
10th GridPP Meeting – 4 June 2004 - 1
Overview
Deployment area organisation
Some history where we are now
Data challenges – experiences
Evolution service challenges
Transition to EGEE
Interoperability
Summary
10th GridPP Meeting – 4 June 2004 - 2
LHC Experiments
Set requirements
Security
group
Storage
group
Advises, informs,
Sets policy
Operations
Centres
- RAL
Call
Centres
- FZK
JTB
Deployment
Area Manager
Certification
Team
Deployment
Team
Experiment
Integration
Team
Collaborative activities
Grid Deployment
Board
participate
Grid Projects:
EDG,
Trillium,
Grid3/OSG,
etc
HEPiX
GGF
Testing group
GDB task
forces
LCG Deployment Area
participate
Set requirements
Regional Centres
LCG Deployment Organisation and Collaborations
10th GridPP Meeting – 4 June 2004 - 3
Communication
Weekly GDA meetings (Monday 14:00, VRVS, phone)
Mail-list – [email protected]
Open to all – need experiments, regional centres, etc.
Technical discussions, understand what priorities are
Policy issues referred back to PEB or GDB
Experience so far:
• Experiments join, regional centres don’t
• NEED participation of system managers and admins – we need a rounded view
of the issues
Weekly core site phone conference
Address specific issues with deployment
Also at CERN:
Weekly DC coordination meetings with each experiment
GDB meetings monthly
Make sure your GDB rep keeps you informed
Open to ways to improve communication!
10th GridPP Meeting – 4 June 2004 - 4
Some history – 2003/2004
Recall goals:
July: Introduce the initial publicly available LCG-1 global grid service
November: Expanded LCG-1 service with resources and functionality
sufficient for the 2004 Computing Data Challenges
LCG-0 : pilot service was deployed in Feb/March
Was used by CMS in Italy very successfully for productions
LCG-1 : based on VDT & EDG 2.0 was deployed in September
Not heavily used by experiments – but was successfully used by CMS for
production over Christmas and (US-)Atlas demonstrated interoperability
with Grid2003
Lacked a real (managed) SE and integration with MSS
LCG-2 : based on VDT & EDG 2.1 was ready by end 2003
Data management tools integrated with SRM, intended to package dCache
as managed disk-SE.
Deployed in Jan/Feb 2004 – many updates – used by experiments in 2004
data challenges
10th GridPP Meeting – 4 June 2004 - 5
Sites in LCG-2/EGEE-0 : June 4 2004
Austria
U-Innsbruck
Canada
Triumf
Alberta
Carleton
Montreal
Toronto
Italy
CNAF
Frascati
Legnaro
Milano
Napoli
Roma
Torino
Czech
Republic
Prague-FZU
Prague-CESNET
Japan
Tokyo
Netherlands
NIKHEF
France
CC-IN2P3
Clermont-Ferrand
Pakistan
NCP
Germany
FZK
Aachen
DESY
Wuppertal
Poland
Krakow
Portugal
LIP
Russia
SINP-Moscow
JINR-Dubna
Spain
PIC
UAM
USC
UB-Barcelona
IFCA
CIEMAT
IFIC
Greece
HellasGrid
Hungary
Budapest
India
TIFR
Israel
Tel-Aviv
Weizmann
Switzerland
CERN
CSCS
Taiwan
ASCC
NCU
UK
RAL
Birmingham
Cavendish
Glasgow
Imperial
Lancaster
Manchester
QMUL
RAL-PP
Sheffield
UCL
US
BNL
FNAL
HP
Puerto-Rico
• 22 Countries
• 58 Sites (45 Europe, 2 US, 5 Canada, 5 Asia, 1 HP)
• Coming: New Zealand, China,
other HP (Brazil, Singapore)
• 3800 cpu
10th GridPP Meeting – 4 June 2004 - 6
Experience: Data challenges
Alice has been running since March
CMS DC04
LHCb now starting seriously
Atlas starting now
See talks from June 2
10th GridPP Meeting – 4 June 2004 - 7
Data challenges – so far
Resources
CPU available – Alice could not fully utilise – storage limitations
Disk available – mostly very small amounts
• Need:
– Plan space vs cpu at a site
– Ensure that commitments are provided
• To some extent not requested – delay in dcache SE – asked not to
commit all to classic SE’s as expected/worried about migration
Alice and CMS –
• Number and size (small) of files:
– limitations of existing Castor system, also problems in Enstore/dCache
CPU is mostly in core sites
At the moment (most of) the other sites have relatively few cpu
assigned
10th GridPP Meeting – 4 June 2004 - 8
Data challenges – 2
Services:
LCG-2 services (RB, BDII, CE, SE etc) have been extremely reliable
and stable
• Even RLS was stable (other issues)
BDII has been extremely reliable
• Provided to experiments – allowed them to define a view of the system
Software deployment system works
Needs some improvement – esp for sites with no shared filesystem
Information system
Schema does not match batch system functionality
Information published (job slots, ETT, etc.) does not reflect batch
system
Solve with CE per VO, need to improve/adapt schema (?)
10th GridPP Meeting – 4 June 2004 - 9
RLS issues
RLS performance was biggest problem
Many fixes made during challenge:
CLI tools based on C++ API in place on Java tools
Added support for non-SE entries
Additional tools (register with existing guid)
Case sensitivity
Performance analysis – usage of metadata queries
Lack of bulk operations
No support for transactions
Still unresolved service performance issue (see
degradation) –
seems to be server related
No data loss or extended service downtime
Replication tests with CNAF
Not really tested by CMS
10th GridPP Meeting – 4 June 2004 - 10
RLS – cont.
Many of above issues addressed in version currently being
tested
Preparing a note describing proposed improvements for
discussion: e.g.
Combine RMC and LRC into single db to allow db to optimise and
join
Resolve issues found in data challenges
Model for replicated/distributed catalogs?
Is the model of metadata appropriate?
Experiment vs POOL vs RLS
With DB group continue to investigate Oracle replication
10th GridPP Meeting – 4 June 2004 - 11
Evolution: missing features
A full storage element
dCache has had many problems
Nice features of SRM gave users a lot of convenience:
- auto directory creation;
We were able to continue improving our setup during
DC04:
- The biggest performance gain was: Michael and his team
in DESY developed a new module that reduces the delegated
proxy's modulus size in SRM and speeds up the interaction
between SRM client and server 3.5 times;
(From CMS FNAL team, based on work done by deployment
group)
Packaged - to be deployed
Is dCache sufficient/the only solution?
Demonstrated integration of Tier 1 MSS’s
Full data management tools
10th GridPP Meeting – 4 June 2004 - 12
Evolution: missing features
Functionally:
Port to other RH-derived linux
• This is now becoming urgent – new hardware, security patches, …
VOMS
• At least the basic part
R-GMA
• For monitoring
Replace OpenPBS as default batch system
Operationally:
Assumption of real operational management by GOCs
• A lot of work on basics has been done – but need problem management
User call centre
• Lack of take-up
• Propose FZK/GOC team come to CERN for 1-2 days to really sort this out
Accounting:
• Critical – we have no information about what has been used during the DC’s –
important for us and for the experiments
Monitoring:
• Grid: lack of consistency in what is presented for each site
• Experiments: we must put R-GMA in place (at least)
10th GridPP Meeting – 4 June 2004 - 13
Evolution: Service Challenges
Purpose
Understand what it takes to operate a real grid service – run for days/weeks
at a time (outside of experiment Data Challenges)
Trigger/encourage the Tier1 planning – move towards real resource
planning for phase 2 – based on realistic usage patterns
• How does a Tier 1 decide what capacity to provide?
• What planning is needed to achieve that?
• Where are we in this process?
Get the essential grid services ramped up to needed levels – and
demonstrate that they work
Set out milestones needed to achieve goals during the service challenges
NB: This is focussed on Tier 0 – Tier 1/large Tier 2
Data management, batch production and analysis
By end 2004 – have in place a robust and reliable data management
service and support infrastructure and robust batch job submission
10th GridPP Meeting – 4 June 2004 - 14
Service challenges – examples
Data Management
Networking, file transfer, data management
Storage management and interoperability
Fully functional storage element (SE)
Continuous job probes
Understand limits
Operations centres
Accounting, assume levels of service responsibility, etc
Hand-off of responsibility (RAL-Taipei-US/Canada)
"Security incident"
Detection, incident response, dissemination and resolution
IP connectivity
Milestones to remove (implementation) need outbound connection from WN
User support
Assumption of responsibility, demonstrate staff in place, etc
VO management
Robust and flexible registration, management interfaces, etc
Etc.
10th GridPP Meeting – 4 June 2004 - 15
Data Management – example
Data management builds on a stack of underlying services:
Network
Robust file transfer
Storage interfaces and functionality
Replica location service
Data management tools
10th GridPP Meeting – 4 June 2004 - 16
Data management – 2
Network layer:
Proposed set of network milestones already in draft
• Network and fabric groups at CERN – collaborate with (initially) “official” Tier 1’s
• Dedicated private networks for Tier 0 Tier 1 “online” raw data transfers
File transfer service layer:
Move a file from A to B, with good perfomance and reliability
This service would normally only be visible via the data movement service
• Only app that can access/schedule/control this network
E.g. of this is gridftp, bbftp, etc.
Reliability – the service must detect failure, retry, etc.
Interfaces to storage systems (SRM)
The US-CMS/CERN “Edge Computing” project might be an instance of
this layer (network + file transfer)
10th GridPP Meeting – 4 June 2004 - 17
Data management – 3
Data movement service layer:
Builds on top of file transfer and network layers
To provide an absolutely reliable and dependable service with good
performance
Implement queuing, priorities, etc.
Initiates file transfers using file transfer service
Acts on application’s behalf – a file handed to the service will be
guaranteed to arrive
Replica Location Service:
Makes use of data movement
Should be distributed:
• Distributed/replicated databases (Oracle) with export/import to
XML/other db’s?
• RLI model?
10th GridPP Meeting – 4 June 2004 - 18
Job probes – example
Continuous flood of jobs
Fill all resources
Use as probes – test if they can use the resources
• Data access, cpu, etc
Understand limitations, bottlenecks of the system
• Baseline measurement, find limits, build and improve
This might be a function of the GOC
Overseen by RAL-Taipei-+ collaboration ?
A challenge might run for a week
Outside of experiment data challenges
In parallel (or part of) data management or other challenges
10th GridPP Meeting – 4 June 2004 - 19
Transition to EGEE
Clarify:
LCG project
LCG applications
LCG middleware release
LCG infrastructure
middleware and infrastructure
EGEE (i.e. LCG) infrastructure
The LCG-2 infrastructure IS the EGEE infrastructure
Can be used now by other applications
Expect to run LCG-2 based infrastructure for 1 year
New middleware has to be better than this becomes
EGEE-developed middleware runs on pre-production
Moves to production when more functional/stable/reliable/…
10th GridPP Meeting – 4 June 2004 - 20
Some remarks
Existing LCG-2 sites already support many VOs
Not only LCG
Front-line support for all VOs is via the ROCs
Process to introduce a new VO
Well defined
Some tools needed to make the mechanics simpler
Evaluation of new middleware by applications, and
preparation for deployment in EGEE-1
This is what the pre-production service is for
Resource allocation/negotiation
OMC/ROC managers/NA4 – negotiate with RC’s and apps
10th GridPP Meeting – 4 June 2004 - 21
Joining EGEE – Overview of process
Application nominates VO manager
Find (CIC) to operate VO server
VO is added to registration procedure
Determine access policy:
Propose discussion (body) NA4 + ROC manager group
• Which sites will accept to run app (funding, political constraints)
• Need for a test VO?
Modify site configs to allow the VO access
Negotiate CICs to run VO-specific services:
VO server (see above)
RLS service if required
Resource Brokers (can be some general at CIC and others owned by
apps), UIs – general at CIC/ROC – or on apps machines etc
Potentially (if needed) BDII to define apps view of resources
Application software installation
Understand application environment, and how installed at sites
Many of these issues can be negotiated by NA4/SA1 in a short
discussion with the new apps community
10th GridPP Meeting – 4 June 2004 - 22
Resource Negotiation Policy
The EGEE infrastructure is intended to support and provide resources
to many virtual organisations
Initially HEP (4 LHC experiments) + Biomedical
Each RC supports many VOs and several application domains – situation
now for centres in LCG
Initially must balance resources contributed by the application domains
and those that they consume
Resource centres may have specific allocation policies
• E.g. due to funding agency attribution by science or by project
Expect a level of peer review within application domains to inform the
allocation process
New VOs and Resource centres should satisfy minimum requirements
Commit to bring a level of additional resources consistent with their
requirements
Requirement on JRA1 to provide mechanisms to implement/enforce
quotas, etc
Selection of new VO/RC via NA4
10th GridPP Meeting – 4 June 2004 - 23
New Resource Centres
Procedure for new sites to join LCG2/EGEE is well defined
and documented
Sites can join now
Coordination for this is via the ROCs
Who will support the installations, set-up, and operation
10th GridPP Meeting – 4 June 2004 - 24
Certification, Testing and Release Cycle
Dev
Tag
SA1
Integrate
HEP
EXPTS
Basic
Functionality
Tests
BIO-MED
Run
Certification
Matrix
OTHER
TBD
Run tests
C&T suites
Site suites
Release
candidate
tag
APPS
SW
Installation
Certified
release
tag
DEPLOY
Deployment
release
tag
SERVICES
PRODUCTION
APP
INTEGR
PRE-PRODUCTION
CERTIFICATION
TESTING
DEPLOYMENT
PREPARATION
DEVELOPMENT & INTEGRATION
UNIT & FUNCTIONAL TESTING
JRA1
Production
tag
10th GridPP Meeting – 4 June 2004 - 25
Interoperability
Several grid infrastructures for LHC experiments:
LCG-2/EGEE, Grid2003/OSG, Nordugrid, other national grids
LCG/EGEE explicit goals to interoperate
One of LCG service challenges
Joint projects on storage elements, file catalogs, VO management,
etc.
Most are VDT (or at least Globus-based)
Grid3 & LCG use GLUE schema
Issues are:
File catalogs, information schema, etc at technical level
Policy and semantic issues
10th GridPP Meeting – 4 June 2004 - 26
Deployment – GridPP support
GridPP contributions to deployment have been crucial:
5 of CERN deployment team funded by PPARC
Essential to bringing the current release to such stability and
reliability – and it’s not that hard to install – 58 sites so far
Grid Operations Centre at RAL
Security team – very active
10th GridPP Meeting – 4 June 2004 - 27
Summary
Huge amount of work done in the last year to produce a robust set of
middleware
These lessons must be applied to new developments
LCG is being successfully used in experiment data challenges
Many problems found and addressed (tools, bugs, etc)
Other fundamental problems subject of development
Services are now very reliable
Plans for service challenges to help move forward
Must ensure that only single developments – coordinate
EGEE/LCG/OSG/etc.
Push for interoperability at all levels – experiments have big role to play in
insisting on single solutions
Emphasis now on strengthening the operational infrastructure
EGEE investment helps here
PPARC/GridPP support has been essential
10th GridPP Meeting – 4 June 2004 - 28