LCG Status and Plans GridPP13 Durham, UK 4th July 2005 Ian Bird IT/GD, CERN GridPP 13; Durham, 4th July, 2005 Overview  Introduction  Project goals and overview  Status  Applications.

Download Report

Transcript LCG Status and Plans GridPP13 Durham, UK 4th July 2005 Ian Bird IT/GD, CERN GridPP 13; Durham, 4th July, 2005 Overview  Introduction  Project goals and overview  Status  Applications.

LCG Status and Plans
GridPP13
Durham, UK
4th July 2005
Ian Bird
IT/GD, CERN
GridPP 13; Durham, 4th July, 2005
Overview
2

Introduction
 Project goals and overview

Status
 Applications area
 Fabric
 Deployment and Operations



Baseline Services
Service Challenges
Summary
3

The goal of the LCG project is to prototype and deploy
the computing environment for the LHC experiments

Two phases:
We are here
GridPP 13; Durham, 4th July, 2005
LCG – Goals




Phase 1: 2002 – 2005
Build a service prototype, based on existing grid middleware
Gain experience in running a production grid service
Produce the TDR for the final system
LCG and Experiment Computing TDRs completed
and presented to the LHCC last week
 Phase 2: 2006 – 2008
 Build and commission the initial LHC computing environment
GridPP 13; Durham, 4th July, 2005
Project Areas & Management
4
Project Leader
Les Robertson
Resource Manager – Chris Eck
Planning Officer – Jürgen Knobloch
Administration – Fabienne Baud-Lavigne
Distributed Analysis - ARDA
Massimo Lamanna
Applications Area
Pere Mato
Middleware Area
Frédéric Hemmer
Development environment
Joint projects, Data management
Distributed analysis
Provision of a base set of grid middleware
(acquisition, development, integration)
Testing, maintenance, support
CERN Fabric Area
Bernd Panzer
Large cluster management
Data recording, Cluster technology
Networking, Computing service at CERN
Prototyping of distributed
end-user analysis using
grid technology
Grid Deployment Area
Ian Bird
Establishing and managing the Grid Service
- Middleware, certification, security
operations, registration, authorisation,
accounting
Joint with EGEE
Applications Area
GridPP 13; Durham, 4th July, 2005
Application Area Focus
6


Deliver the common physics applications software
Organized to ensure focus on real experiment needs






Experiment-driven requirements and monitoring
Architects in management and execution
Open information flow and decision making
Participation of experiment developers
Frequent releases enabling iterative feedback
Success defined by experiment validation
 Integration, evaluation, successful deployment
GridPP 13; Durham, 4th July, 2005
Validation Highlights
7

POOL successfully used in large scale production in ATLAS, CMS,
LHCb data challenges in 2004



Geant4 firmly established as baseline simulation in successful
ATLAS, CMS, LHCb production



EM & hadronic physics validated
Highly stable: 1 G4-related crash per O(1M) events
SEAL components underpin POOL’s success, in particular the
dictionary system


~400TB of POOL data produced
Objective of a quickly-developed persistency hybrid leveraging ROOT
I/O and RDBMSes has been fulfilled
Now entering a second generation with Reflex
SPI’s Savannah project portal and external software service are
accepted standards inside and outside the project
GridPP 13; Durham, 4th July, 2005
Current AA Projects
8

SPI – Software process infrastructure (A. Aimar)
 Software and development services: external libraries,
savannah, software distribution, support for build, test, QA,
etc.

ROOT – Core Libraries and Services (R. Brun)
 Foundation class libraries, math libraries, framework services,
dictionaries, scripting, GUI, graphics, etc.

POOL – Persistency Framework (D. Duellmann)
 Storage manager, file catalogs, event collections, relational
access layer, conditions database, etc.

SIMU - Simulation project (G. Cosmo)
 Simulation framework, physics validation studies, MC event
generators, participation in Geant4, Fluka.
GridPP 13; Durham, 4th July, 2005
SEAL and ROOT Merge
9


Major change in the AA has been the merge of the SEAL
project with ROOT project
Details of the merge are being discussed following a process
defined by the AF





Breakdown into a number of topics
Proposals discussed with the experiments
Public presentations
Final decisions by the AF
Current status
 Dictionary plans approved
 MathCore and Vector libraries proposals have been approved
 First development release of ROOT including these new
libraries
GridPP 13; Durham, 4th July, 2005
Ongoing work
10

SPI
 Porting LCG-AA software to amd64 (gcc 3.4.4)
 Finalizing software distribution based on Pacman
 QA tools: test coverage and savannah reports

ROOT
 Development version v5.02 released last week
 Including new libraries: mathcore, reflex, cintex, roofit

POOL
 Version 2.1 released including new file catalog implementations:
LFCCatalog (lfc), GliteCatalog (glite, Fireman), GTCatalog (globus toolkit)
 New version of Conditions DB (COOL) 1.2
 Adapting POOL to new dictionaries (Reflex)

SIMU
 New Geant4 public minor release 7.1 is being prepared
 Public release of Fluka expected by end July
 Intense activity in the combined calorimeter physics validation with
ATLAS, report in September.
 New MC generators being added (CASCADE , CHARYBDIS, etc.) into the
already long list of generators provided
 Prototyping persistency of Geant4 geometry with ROOT
Fabric Area
GridPP 13; Durham, 4th July, 2005
CERN Fabric
12

Fabric automation has seen very good progress
 The new systems for managing large farms are in production at
CERN
Extremely Large Fabric
management system
configuration, installation and
management of nodes
lemon
LHC Era Monitoring - system
& service monitoring
LHC Era Automated Fabric –
hardware / state management
Includes technology developed
by European DataGrid
GridPP 13; Durham, 4th July, 2005
CERN Fabric
13

Fabric automation has seen very good progress
 The new systems for managing large farms are in production at
CERN

New CASTOR Mass Storage System
 Was deployed first on the high throughput cluster for the recent
ALICE data recording computing challenge

Agreement on collaboration with Fermilab on Linux distribution
 Scientific Linux based on Red Hat Enterprise 3
 Improves uniformity between the HEP sites serving LHC and Run
2 experiments

CERN computer centre preparations
 Power upgrade to 2.5 MW
 Computer centre refurbishment well under way
 Acquisition process started
GridPP 13; Durham, 4th July, 2005
Preparing for 7,000 boxes in
2008
14
GridPP 13; Durham, 4th July, 2005
High Throughput Prototype openlab/LCG




15
Experience with likely
ingredients in LCG:

64-bit
programming

next generation
I/O
(10 Gb
Ethernet,
Infiniband,
etc.)
High performance
cluster used for
evaluations, and for
data challenges with
experiments
Flexible configuration
 components
moved in and out
of production
environment
Co-funded by industry
and CERN
GridPP 13; Durham, 4th July, 2005
Alice Data Recording Challenge
16



Target – one week sustained at 450 MB/sec
Used the new version of Castor mass storage system
Note smooth degradation and recovery after
equipment failure
Deployment and Operations
GridPP 13; Durham, 4th July, 2005
Computing Resources: June 2005
18
Number of sites is already at the scale expected for LHC
- demonstrates the full complexity of operations
Country providing resources
Country anticipating joining
In LCG-2:
 139 sites, 32 countries
 ~14,000 cpu
 ~5 PB storage
Includes non-EGEE sites:
• 9 countries
• 18 sites
Operations Structure
GridPP 13; Durham, 4th July, 2005

19
Operations Management Centre
(OMC):


Core Infrastructure Centres (CIC)






Manage daily grid operations –
oversight, troubleshooting
Run essential infrastructure
services
Provide 2nd level support to ROCs
UK/I, Fr, It, CERN, + Russia (M12)
Hope to get non-European centres
Regional Operations Centres (ROC)




At CERN – coordination etc
Act as front-line support for user
and operations issues
Provide local knowledge and
adaptations
One in each region – many
distributed
User Support Centre (GGUS)

In FZK – support portal – provide
single point of contact (service
desk)
GridPP 13; Durham, 4th July, 2005
Grid Operations


RC
RC
ROC
RC

RC
RC
RC
RC
RC

CIC
RC
CIC
CIC
RC



RC
OMC
CIC
RC
CIC
RC
RC
RC
RC
ROC

RC

RC
RC

Operational oversight (grid operator)
responsibility
rotates weekly between CICs
Report problems to ROC/RC
ROC is responsible for ensuring
problem is resolved
ROC oversees regional RCs
ROCs responsible for organising the
operations in a region

RC
ROC
Essential to scale the operation
CICs act as a single Operations Centre

ROC
CIC
The grid is flat, but
Hierarchy of responsibility
Coordinate deployment of middleware,
etc
CERN coordinates sites not associated
with a ROC
RC = Resource Centre
20
It is in setting up this operational infrastructure where we
have really benefited from EGEE funding
GridPP 13; Durham, 4th July, 2005
Grid monitoring
21



Operation of Production Service: real-time display of grid operations
Accounting information
Selection of Monitoring tools:
 GIIS Monitor + Monitor
Graphs
 Sites Functional Tests
 GOC Data Base
 Scheduled Downtimes
 Live Job Monitor
 GridIce – VO + fabric view
 Certificate Lifetime Monitor
GridPP 13; Durham, 4th July, 2005
Operations focus

Main focus of activities now:
 Improving the operational reliability and
application efficiency:




LCG-2 (=EGEE-0)
2004
Automating monitoring  alarms
Ensuring a 24x7 service
Removing sites that fail functional tests
Operations interoperability with OSG and
others
prototyping
prototyping
 Improving user support:

Demonstrate to users a reliable and
trusted support infrastructure
product
 Deployment of gLite components:


Testing, certification  pre-production
service
Migration planning and deployment – while
maintaining/growing interoperability
2005
product
 Further developments now have to be
driven by experience in real use
LCG-3 (=EGEE-x?)
22
Total jobs
GridPP 13; Durham, 4th July, 2005
Recent ATLAS work
23
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
2
Ja
6
Number of jobs/day
05
n
05
b
05
b
05
b
05
b
0
0
1
2
0
Fe
2
Fe
9
Fe
6
Fe
3
M
2-
a
5
0
r
0
M
9-
a
~10,000 concurrent
jobs in the system
5
0
r
1
M
6-
a
5
0
r
2
M
3-
a
5
0
r
3
M
0-
a
5
0
r
0
r-0
p
A
6-
5
• ATLAS jobs in EGEE/LCG-2 in 2005
•In latest period up to 8K jobs/day
• Several times the current capacity for ATLAS at CERN alone –
shows the reality of the grid solution
Baseline Services &
Service Challenges
GridPP 13; Durham, 4th July, 2005
Baseline Services: Goals
25

Experiments and regional centres agree on baseline services
 Support the computing models for the initial period of LHC
 Thus must be in operation by September 2006.


Expose experiment plans and ideas
Timescales
 For TDR – now
 For SC3 – testing, verification, not all components
 For SC4 – must have complete set


Define services with targets for functionality &
scalability/performance metrics.
Very much driven by the experiments’ needs –
 But try to understand site and other constraints
GridPP 13; Durham, 4th July, 2005
Baseline services


Storage management
services
 Based on SRM as the
interface

Basic transfer services
 gridFTP, srmCopy






VO management services
 Clear need for VOMS:
roles, groups, subgroups

POSIX-like I/O service
 local files, and include
links to catalogues
Reliable file transfer
service
Grid catalogue services
Catalogue and data
management tools
Database services

Compute Resource
Services
Workload management


 Required at Tier1,2

26
Nothing really surprising here – but a lot was clarified in terms
of requirements, implementations, deployment, security, etc
Grid monitoring tools and
services
 Focussed on job
monitoring


VO agent framework
Applications software
installation service
Reliable messaging service
Information system
GridPP 13; Durham, 4th July, 2005
Preliminary: Priorities
27
A: High priority, mandatory service
B: Standard solutions required, experiments could select different implementations
C: Common solutions desirable, but not essential
Service
ALICE
ATLAS
CMS
LHCb
Storage Element
A
A
A
A
Basic transfer tools
A
A
A
A
Reliable file transfer service
A
A
A/B
A
Catalogue services
B
B
B
B
Catalogue and data management tools
C
C
C
C
Compute Element
A
A
A
A
Workload Management
B
A
A
C
VO agents
A
A
A
A
VOMS
A
A
A
A
Database services
A
A
A
A
Posix-I/O
C
C
C
C
Application software installation
C
C
C
C
Job monitoring tools
C
C
C
C
Reliable messaging service
C
C
C
C
Information system
A
A
A
A
GridPP 13; Durham, 4th July, 2005
Service Challenges –
ramp up to LHC start-up service
28
June05 - Technical Design Report
Sep05 - SC3 Service Phase
May06 – SC4 Service Phase
Sep06 – Initial LHC Service in stable operation
Apr07 – LHC Service commissioned
2005
SC2
SC3
2006
2007
cosmics
SC4
LHC Service Operation
2008
First physics
First beams
Full physics run
SC2 – Reliable data transfer (disk-network-disk) – 5 Tier-1s, aggregate 500 MB/sec sustained at CERN
SC3 – Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain – grid data
throughput 500 MB/sec, including mass storage (~25% of the nominal final throughput for the
proton period)
SC4 – All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc. analysis –
sustain nominal final grid data throughput
LHC Service in Operation – September 2006 – ramp up to full operational capacity by April 2007 – capable
of handling twice the nominal data throughput
Baseline Services, Service
Challenges, Production Service,
Pre-production service, gLite
deployment, …
… confused?
GridPP 13; Durham, 4th July, 2005
Services …
30

Baseline services

Service challenges

Production service

gLite deployment

Pre-production service
 Are the set of essential services that the experiments need to be
in production by September 2006
 Verify components in SC3, SC4
 The ramp up of the LHC computing environment – building up the
production service, based on results and lessons of the service
challenges
 The evolving service putting in place new components prototyped in
SC3, SC4
 No big-bang changes, but many releases!!!
 As new components are certified, will be added to the production
service releases, either in parallel with or replacing existing
services
 Should be literally a preview of the production service,
 But is a demonstration of gLite services at the moment – this has
been forced on us by many other constraints (urgency to “deploy”
gLite, need for reasonable scale testing, … )
GridPP 13; Durham, 4th July, 2005
Releases and Distributions
31

We intend to maintain a single line of production middleware
distributions
 Middleware releases from [JRA1, VDT, LCG, …]
 Middleware distributions for deployment from GDA/SA1
 Remember: announcement of a release is months away from a deployable
distribution (based on last 2 years experience)

Throughout
this,“LCG-2.x.x”
we must
Distributions
stillall
labelled
maintain a reliable
production
which less
gradually
improves
 Would like toservice,
change to something
specific to avoid
LCG/EGEEin
confusion and performance
reliability

Frequent updates for Service challenge sites
 But only needed for SC sites

Frequent updates as gLite is deployed
 Not clear if all sites will deploy all gLite components immediately


This is unavoidable
A strong request from LHC experiment spokesmen to the LCG POB:
 “early, gradual and frequent releases of the [baseline] services is essential
rather than waiting for a complete sets”
GridPP 13; Durham, 4th July, 2005
Summary
32

We are at end of LCG Phase 1
 Good time to step back and look at achievements and
issues

LCG Phase 2 has really started
Consolidation of AA projects
Baseline services
Service challenges and experiment data challenges
Acquisitions process starting
 No new developments  make what we have work
absolutely reliably, and be scaleable, performant






Timescale is extremely tight
Must ensure that we have appropriate levels of effort
committed