e-Infrastructure in Europe

Download Report

Transcript e-Infrastructure in Europe

Enabling Grids for E-sciencE
The LHC Computing GRID
service and GRID User
Interface
Olga Kodolova
(The presentation uses talks of Les Robertson, Julia
Andreeva, Vladimir Korenkov, Hans Hoffman)
Contents
Enabling Grids for E-sciencE
• LCG project short overview
• LHC computing model and requirements to the LCG
project (as estimated in the LCG TDR)
• Middleware evolution, new generation- gLite
• GRID User Interface
• Summary
2
LCG project
Enabling Grids for E-sciencE
• LCG project
approved by CERN Council in September 2001
• LHC Experiments
– Grid projects: Europe, US
– Regional & national centres
Goal
– Prepare and deploy the computing environment to help the
experiments analyse the data from the LHC detectors
Phase 1 – 2002-05
– development of common software prototype
– operation of a pilot computing service
Phase 2 – 2006-08
– acquire, build and operate the LHC computing service
3
Applications Area
Common
Enablingprojects
Grids for E-sciencE
Libraries and tools,
data management
Distributed Analysis
Joint project on distributed
analysis with the LHC
experiments
LCG activities
Middleware Area
Provision of grid
middleware – acquisition,
development, integration,
testing, support
Grid Deployment Area
Establishing and managing the
Grid Service - Middleware
certification, security, operations.
Service Challenges
CERN Fabric Area
Cluster management
Data handling
Cluster technology
Networking (WAN+local)
Computing service at CERN
4
LHC computing Grid Service
Enabling Grids for E-sciencE
5
Worldwide LHC Computing Grid
Enabling Grids for E-sciencE
• The goal is to create a distributed computing environment for data
storage and analysis for 4 LHC experiments
– Grid projects: Europe, US
– Regional & national centers
Phase 1 – 2002-05
– development of common software prototype
– operation of a pilot computing service
Phase 2 – 2006-08 • WLCG Collaboration
– acquire, build and operate the LHC computing service
– ~130 computing centres
– 12 large centres
(Tier-0, Tier-1)
– 40-50 federations of smaller
“Tier-2” centres
– 29 countries
6
The LHC detectors
Enabling Grids for E-sciencE
CMS
ATLAS
3.5 Petabytes/year  109 events/year
LHCb
7
Enabling Grids for E-sciencE
ALICE Collaboration


~ 1/2 ATLAS, CMS, ~ 2x LHCb
~1000 people, 30 countries,
~ 80 Institutes
Total weight
Overall diameter
Overall length
Magnetic Field
10,000 t
16.00 m
25 m
0.5 T
8
The CMS Computing requests
Enabling Grids for E-sciencE
Event rate to tape:
100 Hz (109 events/year)
Days of running/year: 200
Level 1 Rate
(Hz)
106
High Level-1 Trigger
(1 MHz)
LHCB
ATLAS
1 billion
Data written to tape:
CMS
105
people
surfing
HERA-B
~ 2 PetaBytes/Year
the Web
Monte Carlo events: 9∙108
KLOE
4
CDF II
10
CPU (Tier-0 +  Tier-2 +  Tier-3):
~ 26 000 kSI2k
CDF
Typical networks: 1 - 10 Gbps
Lifetime of experiment:
2-3 Decades
Users: ~ 1000 physicists
distributed all over the world
103
102
104
High No. Channels
High Bandwidth
(500 Gbit/s)
H1
ZEUS
UA1
105
LEP
High Data Archive
(PetaByte)
ALICE
NA49
106
107
Event Size (bytes)
The full processing and analysis can not be performed by
the standard tool kits
It is essential to use Distributed Computer Resources
9
Eventflow
Enabling Grids for E-sciencE
Rate
RAW
ESD
DST
RECO
[MB]
AOD
Monte
Carlo
Monte
Carlo
[Hz]
[MB]
[kB]
[MB/evt]
% of real
ALICE HI
100
12.5
2.5
ALICE pp
100
1
0.04
ATLAS
200
1.6
0.5
100
2
20
CMS
150
1.5
0.25
50
2
100
LHCb
200
0
.025
0.025
250
4
300
0.4
0.5
100
100
20
50 days running in 2007
107 seconds/year pp from 2008 on  ~109 events/experiment
106 seconds/year heavy ion
10
CPU Requirements
Enabling Grids for E-sciencE
350
Tier-2
300
200
Tier-1
150
100
0
2007
CERN
50
58%
pledged
MSI2000
250
LHCb-Tier-2
CMS-Tier-2
ATLAS-Tier-2
ALICE-Tier-2
LHCb-Tier-1
CMS-Tier-1
ATLAS-Tier-1
ALICE-Tier-1
LHCb-CERN
CMS-CERN
ATLAS-CERN
ALICE-CERN
2008
2009
2010
Year
11
Backup slide Disk Requirements
Enabling Grids for E-sciencE
160
140
Tier-2
120
80
Tier-1
60
0
2007
CERN
20
pledged
40
54%
PB
100
LHCb-Tier-2
CMS-Tier-2
ATLAS-Tier-2
ALICE-Tier-2
LHCb-Tier-1
CMS-Tier-1
ATLAS-Tier-1
ALICE-Tier-1
LHCb-CERN
CMS-CERN
ATLAS-CERN
ALICE-CERN
2008
2009
2010
Year
12
Backup slide Tape Requirements
Enabling Grids for E-sciencE
160
140
LHCb-Tier-1
120
Tier-1
CMS-Tier-1
ATLAS-Tier-1
ALICE-Tier-1
80
LHCb-CERN
60
CMS-CERN
20
0
2007
CERN
40
75%
pledged
PB
100
2008
2009
ATLAS-CERN
ALICE-CERN
2010
Year
13
Requirements
Enabling Grids for E-sciencE
Need to provide transparent access to the
data and the resources for ~5000 scientists in
~500 institutes all over the world
• Reliable and secure storage (~15 Petabytes of data will
be produced annually)
• Fast network (low latency, high bandwidth)
• Managing of the sharing of the resources between
experiments, analysis and production, different
analysis groups, individual users (common policies)
• Support and training
14
LCG Service Hierarchy
Enabling Grids for E-sciencE
Tier-0 – the accelerator centre
• Data acquisition & initial processing
• Long-term data storage
• Distribution of data  Tier-1 centres
Canada – Triumf (Vancouver)
Spain – PIC (Barcelona)
France – IN2P3 (Lyon)
Taiwan – Academia SInica (Taipei)
Germany – Forschunszentrum Karlsruhe
UK – CLRC (Oxford)
Italy – CNAF (Bologna)
US – FermiLab (Illinois)
Netherlands Tier-1 (Amsterdam)
– Brookhaven (NY)
Nordic countries – distributed Tier-1
Tier-1 – “online” to the data acquisition
process  high availability
• Managed Mass Storage –
 grid-enabled data service
• Data-heavy analysis
• National, regional support
Tier-2 : ~120 centres (40-50 federations) in ~29 countries
• Simulation
• End-user analysis – batch and interactive
15
LCG depends on 2 major science grid
infrastructures
Enabling Grids for E-sciencE
Sites
200
180
160
EGEE:
140
120
100
80
40
20
.
Oc
t-0
5
De
c -0
5
0
Oc
t-0
4
De
c -0
4
Fe
b05
Ap
r-0
5
Ju
n-0
5
Au
g05
More than 35K jobs/day
EGEE:
Sites Q1
on the EGEE Grid
35
LHC VOs  30K 2006
jobs/day
30
180
should see one >seamless
40
sites
60
Ap
r-0
4
Ju
n-0
4
Au
g04
The LCG service runs & relies on grid infrastructure
provided
Steady
growthby:
EGEE Grid - All VOs
EGEE
- Enabling Grids forK-Jobs/Day
E-Science-over
the lifetime of
Ramp
up
of
OSG
use
last 6 months
the project
OSG
- US Open
Science Grid
45
20
sites, 40
countries
> 24,000
processors,
~ 5 PB
storage
computing facility
15
10
5
0
Jun-05
Jul-05
Aug-05
alice
atlas
compchem
egeode
ops
planck
25000
20000
No. CPU
25
global
CPU
15000
10000
5000
0
Ap
r- 0
4
Ju
n04
Au
g04
O
ct
-0
4
De
c04
Fe
b05
Ap
r- 0
5
Ju
n05
Au
g05
O
ct
-0
5
De
c05
Fe
b06
K-Jobs/Day
LHC user
30000
Date
country
sites
country
sites
country
sites
Austria
2
India
2
Russia
12
BelgiumDec-05 Jan-06
3
Ireland
15
Serbia
1
Sep-05 Oct-05 Nov-05
Feb-06 Mar-06
Apr-06 May-06
Bulgaria
4
Israel
3
Singapore
1
Canada
7
Italy
25
Slovakia
4
month
China
3
Japan
1
Slovenia
1
Croatia
1
Korea
1
Spain
13
Cyprus
1
Netherlands
cms
lhcb
geant4 31 Sweden
biomed 41
Czech Republic
2
FYROM
Switzerland
Denmark
1
Pakistan
4
egrid
fusion 25 Taipei
magic
Franceesr
8
Poland
Turkey
1
Germany
10
Portugal
1
UK
22
Other VOs Greecedteam 6
Puerto Rico
1
USA
4
Hungary
1
Romania
1
CERN
1
16
Grid Operations
Enabling Grids for E-sciencE
•
– Essential to scale the operation
RC
RC
ROC
RC
•
RC
RC
RC
RC
RC
ROC
CIC
CIC
RC
CIC
CIC
•
RC
RC
RC
RC
RC
ROC
RC
RC - Resource Centre
ROC
RC
Core Infrastructure Centres (CIC)
– Acts as single operations centres (one centre in
shift)
– Daily grid operations – oversight,
troubleshooting
– Run essential infrastructure services
– Provide 2nd level support to ROCs
– UK/I, Fr, It, CERN, + Russia + Taipei
RC
OMC
CIC
Operations Management Centre (OMC):
– At CERN – coordination etc…
RC
CIC
The grid is flat, but there is a Hierarchy of
responsibility
RC
•
Regional Operations Centres (ROC)
– Front-line support for user and operations
– Provide local knowledge and adaptations
– One in each region – many distributed
RC
•
User Support Centre (GGUS)
– In FZK (Karlsruhe) (service desk)
17
LCG Service planning
Enabling Grids for E-sciencE
2006
cosmics
2007
first
physics
Pilot Services –
stable service from 1 June 06
LHC Service in operation – 1 Oct 06
over following six months ramp up to
full operational capacity & performance
LHC service commissioned – 1 Apr 07
2008
full physics
run
18
Software layers
Enabling Grids for E-sciencE
Experiment (VO) specific applications
ALICE
ATLAS
CMS
LHCB
LHC common applications ( ROOT,POOL,SEAL)
Experiment specific high level services (PHEDEX, DIRAC,
ALIEN…)
GRID Middleware
OS & Net services
19
The LHC Computing Hierarchical
Model
Enabling Grids for E-sciencE
• Tier-0 at CERN
– Record RAW data (1.25 GB/s ALICE)
– Distribute second copy to Tier-1s
– Calibrate and do first-pass reconstruction
• Tier-1 centres (11 defined)
– Manage permanent storage – RAW, simulated, processed
– Capacity for reprocessing, bulk analysis
• Tier-2 centres (>~ 100 identified)
– Monte Carlo event simulation
– End-user analysis
• Tier-3
– Facilities at universities and laboratories
– Access to data and processing in Tier-2s, Tier-1s
– Outside the scope of the project
20
Tier-1s
Enabling Grids for E-sciencE
Tier-1 Centre
Experiments served with priority
ALICE
TRIUMF, Canada
ATLAS
CMS
LHCb
X
GridKA, Germany
X
X
X
X
CC, IN2P3, France
X
X
X
X
CNAF, Italy
X
X
X
X
SARA/NIKHEF, NL
X
X
Nordic Data Grid Facility (NDGF)
X
X
X
X
X
X
X
ASCC, Taipei
RAL, UK
BNL, US
X
X
X
FNAL, US
PIC, Spain
X
X
X
X
X
21
Middleware: Baseline services
Enabling Grids for E-sciencE
• In June 2005 the set of baseline service were agreed:
– Basic set of middleware required from the grid
infrastructures
– Agreed by all experiments (minor variations)
– Baseline service group, and later workshops documented
missing features. Defined priorities are followed by the
development plans in LCG and OSG.
• gLite-3.0 (released in May for SC4) contains all of the baseline
services
– Still reliability, performance, management issues to be
addressed
• gLite-3.0 is an evolution of the previous LCG-2.7 and gLite-1.x
middleware
– Deployed in production without disturbing production
environment
– Forms the basis for evolution of the services to add missing
features, improve performance and reliability
22
Architecture – Grid services
Enabling Grids for E-sciencE
• Storage Element
– Mass Storage System (MSS) (CASTOR, Enstore, HPSS, dCache, etc.)
– Storage Resource Manager (SRM) provides a common way to access
MSS, independent of implementation
– File Transfer Services (FTS) provided e.g. by GridFTP or srmCopy
• Computing Element
– Interface to local batch system e.g. Globus gatekeeper.
– Accounting, status query, job monitoring
• Virtual Organization Management
– Virtual Organization Management Services (VOMS)
– Authentication and authorization based on VOMS model.
• Grid Catalogue Services
– Mapping of Globally Unique Identifiers (GUID) to local file name
– Hierarchical namespace, access control
• Interoperability
– EGEE and OSG both use the Virtual Data Toolkit (VDT)
– Different implementations are hidden by common interfaces
23
Grid user interface
Enabling Grids for E-sciencE
Huge user community with a different level of expertise
What is required?
At the individual user level
-Easy way to prepare/submit jobs (ideally not changing anything
regarding interactive submission), follow their progress, find out the
reasons of the possible failures ( ideally without looking in the log file),
error recovery
-Easy access to the data samples located at the remote storage, without
knowledge about physical data location and access protocol
At the VO management level
-Easy way to get an overall picture of the experiment activities on the
world-wide computing distributed system
with clear indication of possible problems of any nature.
24
Job submission tools
Enabling Grids for E-sciencE
LHC specific
applications
Talks
to experiment
DM to find out
Logging
and
wherebookkeeping
data is and how to split user Computer
task
sites
Grid flavour Implements task splitting
Jobspecific
submission
tools
should
Does
packaging
of the user code
and libraries
Checking
hide from
thejobuser complexity
status
Generates executable shell
of dealing with the distributed
Job
Submitting
a
WMS
Generates
Grid submission
computing
facility,
providing
simple
Submission
job
User
instructions
tool and user-friendly
interface
Interface Submits
all jobs belonging to a task
VO specific
Experiments develop
different
Check
status of jobs
belonging to a
Register
given
task and retrieves the job output
solutions
Getting
Error recovery
proxy
Experiment
software
Examples:
GANGA for ATLASVOMS
and LHCb
Experiment
Data
CRAB and ASAP for CMS
Managent
Information
System
File
Catalogue
25
Ganga
Enabling Grids for E-sciencE
•
•
•
Job Submission
Archiving
Automatic
monitoring
26
Ganga and user community
Enabling Grids for E-sciencE
•
Since July 2005, have had three Ganga tutorials for LHCb and two for ATLAS, in various
locations
CERN, September 2005
Cambridge, January 2006
Bologna, June 2006
Attract users who wouldn’t otherwise be considering the Grid
•
Ganga tried out by 100+ people, with positive feedback
– “Very handy way to organise job submission” (ATLAS user)
– “Clever and nicely designed” (LHCb user)
•
Small but growing group of people regularly using Ganga (also from a laptop)
27
CRAB
Enabling Grids for E-sciencE
1) Data location
RefDb
(DBS)
UI
CRAB
2) Job preparation
DB
3) Job submission
PubDb
(DLS)
4) Job status
Widely used by CMS physics community
Jobanalysis
outputjobs
retrieval
Plot shows5)CMS
submitted
LCG/OSG
via Crab for the period 01.06.06-20.07.06
distributed by site:
~83K jobs,
50file
users
over 85 sites
Local
catalog
DB
RB
CE
CE
CE
CE
CE
SE
WN
WN
WN
WN
WN
WN
WN
WN
WN
WN
WN
WN
WN
WN
WN
Log Files/(Job output)
28
ASAP work and information flow
Enabling Grids for E-sciencE
ASAP TASK ManagerMonalisa
Server
RefDB
PubDB
Job
running
on the
Worker
Node
Submits user jobs
Task Manager allows to
Job monitoring
Checks job status
minimize
user effort for
directory
Resubmits in case of failure
task processing .
Fetches the results
It acts JDL
on user behalf
Registers Stores results at the final
gLite/LCG
ASAP UI
the task
locations (retrieving user credentials
to the
Job submission
from MyProxy server) :
Generates web monitoring
task
job
pages This functionality should be Checking
status
manager
Delegates user
Application,applicationversion,
Executable,
Data sample,
Working directoory,
Castor directory to save output,
integrated in the final
CMS analysis system
credentials using
MyProxy
Resubmission in
case of failure
Fetching results
Storing results to
Castor
Number of events to be processed
Number of events per job
Data cards for ORCA application
Output files
location
29
Example of the user task monitoring with
ASAP
Enabling Grids for E-sciencE
Job was resubmitted by the Task Manager multiple times due to
the Grid failures
Currently is running and had processed 715 events
30
Experiment dashboard
Enabling Grids for E-sciencE
Transfer Test monitoring
Phedex monitoring
Job monitoring
Interactive
Experiment Dashboard
should provide a single entry
point to the monitoring
information collected from
Frontier Squid monitoringthe experiment
distributed computing
system
Coffe view
CMS
Dashboard
31
Experiment dashboard
Enabling Grids for E-sciencE
• Originally proposed by CMS experiment (currently in production).
ATLAS dashboard has the same concept and is currently in the
active development phase.
• Originally started from the job related monitoring.
Now is covering other experiment activities like Data Management,
Transfer Tests, etc…
• Using several sources of information: RGMA, Imperial College RB
Data Base, Monalisa, BDII
• Combining Grid related data and VO specific information
• Different middleware platforms (LCG, OSG, NorduGrid)
• In the future the Experiment Dashboard should play a more active
role, not only collecting and displaying monitoring info but also
analyzing it and sending alarms in case of evident problems
32
Job monitoring use cases
Enabling Grids for E-sciencE
-Quantities (how many jobs per site, per user, per submission
tool, per data collection…), distribution over time
-Usage of the resources (CPU, memory, IO rates) aggregated on
different levels
-Sharing of the resources between production and analysis,
different analysis groups, individual users
the problems
of any
nature
-How Grid •-Indication
is behaving,of
success
rate, failure
reasons…
•-How
Pure
Grid
troubles( is
lack
of stability,
scalability,
CMS
application
behaving
, success
rate , performance)
failure
reasons…
•
Site configuration
-How CMS data is distributed
the sites
•
Dataover
publishing
•
Data access
•
Software distribution
•…….
33
Tracing troubles with the Dashboard
Enabling Grids for E-sciencE
High Grid failure rate
At the site
Site misconfiguration
CMS JobRobot
Sends test analysis jobs
to all CMS productions sites
High Application failure
Rate at the site,
Might be a problem
with data access or
software distribution
34
Tracing troubles with the Dashboard
Enabling Grids for E-sciencE
Distribution of jobs by Resource Broker
Resource broker rb106.cern.ch
clearly has a problem
All jobs stay in pending
without allocation to the
computing elements
35
Data Management Monitoring
Enabling Grids for E-sciencE
Atlas Data
Management
Monitoring
36
Conclusions
Enabling Grids for E-sciencE
• LHC Grid computing service relies on the international
grid infrastructures, which are operational
• Operation of the pilot service and experience
accumulated by LHC experiments allowed to define
experiment computing models and to achieve an
agreement between LHC experiments on the baseline
services.
• Data taking starts next year. Main focus on the
reliability, scalability and performance of the overall
system and individual services.
37
Backup slide What is EGEE?
Enabling Grids for E-sciencE
EGEE is the largest Grid
infrastructure project in Europe:
•
•
•
•
•
70 leading institutions in 27 countries,
federated in regional Grids
Leveraging national and regional grid
activities
Started April 2004 (end March 2006)
EU review, February 2005 successful
Preparing 2nd phase of the project –
proposal to EU Grid call September 2005
– 2 years starting April 2006
•
Promoting scientific partnership
outside EU
Goal of EGEE: develop a service grid
infrastructure which is available to
scientists 24 hours-a-day
LCG and EGEE are
different projects
But collaboration is
ensured (sharing
instead duplication)
38
Backup slide Tier-0 -1 -2 Connectivity
Enabling Grids for E-sciencE
National Reasearch Networks (NRENs) at Tier-1s:
ASnet
LHCnet/ESnet
GARR
LHCnet/ESnet
RENATER
DFN
SURFnet6
NORDUnet
RedIRIS
UKERNA
CANARIE
39
Backup slide Tier-0 components
Enabling Grids for E-sciencE
•
•
•
•
Batch system (LSF) manage CPU resources
Shared file system (AFS)
Disk pool and mass storage (MSS) manager (CASTOR)
Extremely Large Fabric management system (ELFms)
– Quattor – system administration – installation and configuration
– LHC Era MONitoring (LEMON) system, server/client based
– LHC-Era Automated Fabric (LEAF) – high-level commands to sets of
nodes
• CPU servers – ‘white boxes’, INTEL processors, (scientific) Linux
• Disk Storage – Network Attached Storage (NAS) – mostly mirrored
• Tape Storage – currently STK robots – future system under
evaluation
• Network – fast gigabit Ethernet switches connected to multigigabit
backbone routers
40
Data Challenges
Enabling Grids for E-sciencE
• ALICE
– PDC04 using AliEn services native or interfaced to LCG-Grid. 400,000
jobs run producing 40 TB of data for the Physics Performance Report.
– PDC05: Event simulation, first-pass reconstruction, transmission to
Tier-1 sites, second pass reconstruction (calibration and storage),
analysis with PROOF – using Grid services from LCG SC3 and AliEn
• ATLAS
– Using tools and resources from LCG, NorduGrid, and Grid3 at 133
sites in 30 countries using over 10,000 processors where 235,000 jobs
produced more than 30 TB of data using an automatic production
system.
• CMS
– 100 TB simulated data reconstructed at a rate of 25 Hz, distributed to
the Tier-1 sites and reprocessed there.
• LHCb
– LCG provided more than 50% of the capacity for the first data
challenge 2004-2005. The production used the DIRAC system.
41
Service Challenges
Enabling Grids for E-sciencE
• A series of Service Challenges (SC) set out to successively
approach the production needs of LHC
• While SC1 did not meet the goal to transfer for 2 weeks
continuously at a rate of 500 MB/s, SC2 did exceed the goal (500
MB/s) by sustaining throughput of 600 MB/s to 7 sites.
• SC3 starts soon, using gLite middleware components, with diskto-disk throughput tests, 10 Gb networking of Tier-1s to CERN
providing SRM (1.1) interface to managed storage at Tier-1s. The
goal is to achieve 150 MB/s disk-to disk and 60 MB/s to managed
tape. There will be also Tier-1 to Tier-2 transfer tests.
• SC4 aims to demonstrate that all requirements from raw data
taking to analysis can be met at least 6 months prior to data
taking. The aggregate rate out of CERN is required to be 1.6 GB/s
to tape at Tier-1s.
• The Service Challenges will turn into production services for the
experiments.
42
Backup slide Key dates for Service
Preparation
Enabling Grids for E-sciencE
Sep05 - SC3 Service Phase
May06 –SC4 Service Phase
Sep06 – Initial LHC Service in stable operation
Apr07 – LHC Service commissioned
2005
SC3
2006
2007
cosmics
SC4
LHC Service Operation
2008
First physics
First beams
Full physics run
• SC3 – Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain –
grid data throughput 1GB/sec, including mass storage 500 MB/sec (150 MB/sec & 60 MB/sec at
Tier-1s)
• SC4 – All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc.
analysis – sustain nominal final grid data throughput (~ 1.5 GB/sec mass storage throughput)
• LHC Service in Operation – September 2006 – ramp up to full operational capacity by April 2007
– capable of handling twice the nominal data throughput
43