LCG Status and Plans GridPP13 Durham, UK 4th July 2005 Ian Bird IT/GD, CERN GridPP 13; Durham, 4th July, 2005 Overview Introduction Project goals and overview Status Applications.
Download
Report
Transcript LCG Status and Plans GridPP13 Durham, UK 4th July 2005 Ian Bird IT/GD, CERN GridPP 13; Durham, 4th July, 2005 Overview Introduction Project goals and overview Status Applications.
LCG Status and Plans
GridPP13
Durham, UK
4th July 2005
Ian Bird
IT/GD, CERN
GridPP 13; Durham, 4th July, 2005
Overview
2
Introduction
Project goals and overview
Status
Applications area
Fabric
Deployment and Operations
Baseline Services
Service Challenges
Summary
3
The goal of the LCG project is to prototype and deploy
the computing environment for the LHC experiments
Two phases:
We are here
GridPP 13; Durham, 4th July, 2005
LCG – Goals
Phase 1: 2002 – 2005
Build a service prototype, based on existing grid middleware
Gain experience in running a production grid service
Produce the TDR for the final system
LCG and Experiment Computing TDRs completed
and presented to the LHCC last week
Phase 2: 2006 – 2008
Build and commission the initial LHC computing environment
GridPP 13; Durham, 4th July, 2005
Project Areas & Management
4
Project Leader
Les Robertson
Resource Manager – Chris Eck
Planning Officer – Jürgen Knobloch
Administration – Fabienne Baud-Lavigne
Distributed Analysis - ARDA
Massimo Lamanna
Applications Area
Pere Mato
Middleware Area
Frédéric Hemmer
Development environment
Joint projects, Data management
Distributed analysis
Provision of a base set of grid middleware
(acquisition, development, integration)
Testing, maintenance, support
CERN Fabric Area
Bernd Panzer
Large cluster management
Data recording, Cluster technology
Networking, Computing service at CERN
Prototyping of distributed
end-user analysis using
grid technology
Grid Deployment Area
Ian Bird
Establishing and managing the Grid Service
- Middleware, certification, security
operations, registration, authorisation,
accounting
Joint with EGEE
Applications Area
GridPP 13; Durham, 4th July, 2005
Application Area Focus
6
Deliver the common physics applications software
Organized to ensure focus on real experiment needs
Experiment-driven requirements and monitoring
Architects in management and execution
Open information flow and decision making
Participation of experiment developers
Frequent releases enabling iterative feedback
Success defined by experiment validation
Integration, evaluation, successful deployment
GridPP 13; Durham, 4th July, 2005
Validation Highlights
7
POOL successfully used in large scale production in ATLAS, CMS,
LHCb data challenges in 2004
Geant4 firmly established as baseline simulation in successful
ATLAS, CMS, LHCb production
EM & hadronic physics validated
Highly stable: 1 G4-related crash per O(1M) events
SEAL components underpin POOL’s success, in particular the
dictionary system
~400TB of POOL data produced
Objective of a quickly-developed persistency hybrid leveraging ROOT
I/O and RDBMSes has been fulfilled
Now entering a second generation with Reflex
SPI’s Savannah project portal and external software service are
accepted standards inside and outside the project
GridPP 13; Durham, 4th July, 2005
Current AA Projects
8
SPI – Software process infrastructure (A. Aimar)
Software and development services: external libraries,
savannah, software distribution, support for build, test, QA,
etc.
ROOT – Core Libraries and Services (R. Brun)
Foundation class libraries, math libraries, framework services,
dictionaries, scripting, GUI, graphics, etc.
POOL – Persistency Framework (D. Duellmann)
Storage manager, file catalogs, event collections, relational
access layer, conditions database, etc.
SIMU - Simulation project (G. Cosmo)
Simulation framework, physics validation studies, MC event
generators, participation in Geant4, Fluka.
GridPP 13; Durham, 4th July, 2005
SEAL and ROOT Merge
9
Major change in the AA has been the merge of the SEAL
project with ROOT project
Details of the merge are being discussed following a process
defined by the AF
Breakdown into a number of topics
Proposals discussed with the experiments
Public presentations
Final decisions by the AF
Current status
Dictionary plans approved
MathCore and Vector libraries proposals have been approved
First development release of ROOT including these new
libraries
GridPP 13; Durham, 4th July, 2005
Ongoing work
10
SPI
Porting LCG-AA software to amd64 (gcc 3.4.4)
Finalizing software distribution based on Pacman
QA tools: test coverage and savannah reports
ROOT
Development version v5.02 released last week
Including new libraries: mathcore, reflex, cintex, roofit
POOL
Version 2.1 released including new file catalog implementations:
LFCCatalog (lfc), GliteCatalog (glite, Fireman), GTCatalog (globus toolkit)
New version of Conditions DB (COOL) 1.2
Adapting POOL to new dictionaries (Reflex)
SIMU
New Geant4 public minor release 7.1 is being prepared
Public release of Fluka expected by end July
Intense activity in the combined calorimeter physics validation with
ATLAS, report in September.
New MC generators being added (CASCADE , CHARYBDIS, etc.) into the
already long list of generators provided
Prototyping persistency of Geant4 geometry with ROOT
Fabric Area
GridPP 13; Durham, 4th July, 2005
CERN Fabric
12
Fabric automation has seen very good progress
The new systems for managing large farms are in production at
CERN
Extremely Large Fabric
management system
configuration, installation and
management of nodes
lemon
LHC Era Monitoring - system
& service monitoring
LHC Era Automated Fabric –
hardware / state management
Includes technology developed
by European DataGrid
GridPP 13; Durham, 4th July, 2005
CERN Fabric
13
Fabric automation has seen very good progress
The new systems for managing large farms are in production at
CERN
New CASTOR Mass Storage System
Was deployed first on the high throughput cluster for the recent
ALICE data recording computing challenge
Agreement on collaboration with Fermilab on Linux distribution
Scientific Linux based on Red Hat Enterprise 3
Improves uniformity between the HEP sites serving LHC and Run
2 experiments
CERN computer centre preparations
Power upgrade to 2.5 MW
Computer centre refurbishment well under way
Acquisition process started
GridPP 13; Durham, 4th July, 2005
Preparing for 7,000 boxes in
2008
14
GridPP 13; Durham, 4th July, 2005
High Throughput Prototype openlab/LCG
15
Experience with likely
ingredients in LCG:
64-bit
programming
next generation
I/O
(10 Gb
Ethernet,
Infiniband,
etc.)
High performance
cluster used for
evaluations, and for
data challenges with
experiments
Flexible configuration
components
moved in and out
of production
environment
Co-funded by industry
and CERN
GridPP 13; Durham, 4th July, 2005
Alice Data Recording Challenge
16
Target – one week sustained at 450 MB/sec
Used the new version of Castor mass storage system
Note smooth degradation and recovery after
equipment failure
Deployment and Operations
GridPP 13; Durham, 4th July, 2005
Computing Resources: June 2005
18
Number of sites is already at the scale expected for LHC
- demonstrates the full complexity of operations
Country providing resources
Country anticipating joining
In LCG-2:
139 sites, 32 countries
~14,000 cpu
~5 PB storage
Includes non-EGEE sites:
• 9 countries
• 18 sites
Operations Structure
GridPP 13; Durham, 4th July, 2005
19
Operations Management Centre
(OMC):
Core Infrastructure Centres (CIC)
Manage daily grid operations –
oversight, troubleshooting
Run essential infrastructure
services
Provide 2nd level support to ROCs
UK/I, Fr, It, CERN, + Russia (M12)
Hope to get non-European centres
Regional Operations Centres (ROC)
At CERN – coordination etc
Act as front-line support for user
and operations issues
Provide local knowledge and
adaptations
One in each region – many
distributed
User Support Centre (GGUS)
In FZK – support portal – provide
single point of contact (service
desk)
GridPP 13; Durham, 4th July, 2005
Grid Operations
RC
RC
ROC
RC
RC
RC
RC
RC
RC
CIC
RC
CIC
CIC
RC
RC
OMC
CIC
RC
CIC
RC
RC
RC
RC
ROC
RC
RC
RC
Operational oversight (grid operator)
responsibility
rotates weekly between CICs
Report problems to ROC/RC
ROC is responsible for ensuring
problem is resolved
ROC oversees regional RCs
ROCs responsible for organising the
operations in a region
RC
ROC
Essential to scale the operation
CICs act as a single Operations Centre
ROC
CIC
The grid is flat, but
Hierarchy of responsibility
Coordinate deployment of middleware,
etc
CERN coordinates sites not associated
with a ROC
RC = Resource Centre
20
It is in setting up this operational infrastructure where we
have really benefited from EGEE funding
GridPP 13; Durham, 4th July, 2005
Grid monitoring
21
Operation of Production Service: real-time display of grid operations
Accounting information
Selection of Monitoring tools:
GIIS Monitor + Monitor
Graphs
Sites Functional Tests
GOC Data Base
Scheduled Downtimes
Live Job Monitor
GridIce – VO + fabric view
Certificate Lifetime Monitor
GridPP 13; Durham, 4th July, 2005
Operations focus
Main focus of activities now:
Improving the operational reliability and
application efficiency:
LCG-2 (=EGEE-0)
2004
Automating monitoring alarms
Ensuring a 24x7 service
Removing sites that fail functional tests
Operations interoperability with OSG and
others
prototyping
prototyping
Improving user support:
Demonstrate to users a reliable and
trusted support infrastructure
product
Deployment of gLite components:
Testing, certification pre-production
service
Migration planning and deployment – while
maintaining/growing interoperability
2005
product
Further developments now have to be
driven by experience in real use
LCG-3 (=EGEE-x?)
22
Total jobs
GridPP 13; Durham, 4th July, 2005
Recent ATLAS work
23
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
2
Ja
6
Number of jobs/day
05
n
05
b
05
b
05
b
05
b
0
0
1
2
0
Fe
2
Fe
9
Fe
6
Fe
3
M
2-
a
5
0
r
0
M
9-
a
~10,000 concurrent
jobs in the system
5
0
r
1
M
6-
a
5
0
r
2
M
3-
a
5
0
r
3
M
0-
a
5
0
r
0
r-0
p
A
6-
5
• ATLAS jobs in EGEE/LCG-2 in 2005
•In latest period up to 8K jobs/day
• Several times the current capacity for ATLAS at CERN alone –
shows the reality of the grid solution
Baseline Services &
Service Challenges
GridPP 13; Durham, 4th July, 2005
Baseline Services: Goals
25
Experiments and regional centres agree on baseline services
Support the computing models for the initial period of LHC
Thus must be in operation by September 2006.
Expose experiment plans and ideas
Timescales
For TDR – now
For SC3 – testing, verification, not all components
For SC4 – must have complete set
Define services with targets for functionality &
scalability/performance metrics.
Very much driven by the experiments’ needs –
But try to understand site and other constraints
GridPP 13; Durham, 4th July, 2005
Baseline services
Storage management
services
Based on SRM as the
interface
Basic transfer services
gridFTP, srmCopy
VO management services
Clear need for VOMS:
roles, groups, subgroups
POSIX-like I/O service
local files, and include
links to catalogues
Reliable file transfer
service
Grid catalogue services
Catalogue and data
management tools
Database services
Compute Resource
Services
Workload management
Required at Tier1,2
26
Nothing really surprising here – but a lot was clarified in terms
of requirements, implementations, deployment, security, etc
Grid monitoring tools and
services
Focussed on job
monitoring
VO agent framework
Applications software
installation service
Reliable messaging service
Information system
GridPP 13; Durham, 4th July, 2005
Preliminary: Priorities
27
A: High priority, mandatory service
B: Standard solutions required, experiments could select different implementations
C: Common solutions desirable, but not essential
Service
ALICE
ATLAS
CMS
LHCb
Storage Element
A
A
A
A
Basic transfer tools
A
A
A
A
Reliable file transfer service
A
A
A/B
A
Catalogue services
B
B
B
B
Catalogue and data management tools
C
C
C
C
Compute Element
A
A
A
A
Workload Management
B
A
A
C
VO agents
A
A
A
A
VOMS
A
A
A
A
Database services
A
A
A
A
Posix-I/O
C
C
C
C
Application software installation
C
C
C
C
Job monitoring tools
C
C
C
C
Reliable messaging service
C
C
C
C
Information system
A
A
A
A
GridPP 13; Durham, 4th July, 2005
Service Challenges –
ramp up to LHC start-up service
28
June05 - Technical Design Report
Sep05 - SC3 Service Phase
May06 – SC4 Service Phase
Sep06 – Initial LHC Service in stable operation
Apr07 – LHC Service commissioned
2005
SC2
SC3
2006
2007
cosmics
SC4
LHC Service Operation
2008
First physics
First beams
Full physics run
SC2 – Reliable data transfer (disk-network-disk) – 5 Tier-1s, aggregate 500 MB/sec sustained at CERN
SC3 – Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain – grid data
throughput 500 MB/sec, including mass storage (~25% of the nominal final throughput for the
proton period)
SC4 – All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc. analysis –
sustain nominal final grid data throughput
LHC Service in Operation – September 2006 – ramp up to full operational capacity by April 2007 – capable
of handling twice the nominal data throughput
Baseline Services, Service
Challenges, Production Service,
Pre-production service, gLite
deployment, …
… confused?
GridPP 13; Durham, 4th July, 2005
Services …
30
Baseline services
Service challenges
Production service
gLite deployment
Pre-production service
Are the set of essential services that the experiments need to be
in production by September 2006
Verify components in SC3, SC4
The ramp up of the LHC computing environment – building up the
production service, based on results and lessons of the service
challenges
The evolving service putting in place new components prototyped in
SC3, SC4
No big-bang changes, but many releases!!!
As new components are certified, will be added to the production
service releases, either in parallel with or replacing existing
services
Should be literally a preview of the production service,
But is a demonstration of gLite services at the moment – this has
been forced on us by many other constraints (urgency to “deploy”
gLite, need for reasonable scale testing, … )
GridPP 13; Durham, 4th July, 2005
Releases and Distributions
31
We intend to maintain a single line of production middleware
distributions
Middleware releases from [JRA1, VDT, LCG, …]
Middleware distributions for deployment from GDA/SA1
Remember: announcement of a release is months away from a deployable
distribution (based on last 2 years experience)
Throughout
this,“LCG-2.x.x”
we must
Distributions
stillall
labelled
maintain a reliable
production
which less
gradually
improves
Would like toservice,
change to something
specific to avoid
LCG/EGEEin
confusion and performance
reliability
Frequent updates for Service challenge sites
But only needed for SC sites
Frequent updates as gLite is deployed
Not clear if all sites will deploy all gLite components immediately
This is unavoidable
A strong request from LHC experiment spokesmen to the LCG POB:
“early, gradual and frequent releases of the [baseline] services is essential
rather than waiting for a complete sets”
GridPP 13; Durham, 4th July, 2005
Summary
32
We are at end of LCG Phase 1
Good time to step back and look at achievements and
issues
LCG Phase 2 has really started
Consolidation of AA projects
Baseline services
Service challenges and experiment data challenges
Acquisitions process starting
No new developments make what we have work
absolutely reliably, and be scaleable, performant
Timescale is extremely tight
Must ensure that we have appropriate levels of effort
committed