Moving the LHCb Monte Carlo production system to the GRID

Download Report

Transcript Moving the LHCb Monte Carlo production system to the GRID

Moving the LHCb Monte Carlo production system to
the GRID
D.Galli,U.Marconi,V.Vagnoni
N Brook
E.van Herwijnen,P.Mato
A.Khan
M.McCubbin,G.D.Patel
A.Tsaregorodtsev
H.Bulten,S.Klous
F.Harris
G.N.Patrick,R.A.Sansum
3 Sept 2001
INFN Bologna
Bristol
CERN
Edinburgh
Liverpool
Marseille
Nikhef
Oxford
RAL
F HARRIS
CHEP, Beijing
1
Overview of presentation
• Functionality and distribution of the current system
• Experience with the use of Globus in tests and
production
• Requirements and planning for the use of DataGrid
middleware and security system
• Planning for interfacing GAUDI software framework
to GRID services
• Conclusions
3 Sept 2001
F HARRIS
CHEP, Beijing
2
LHCb distributed computing environment
(15 countries - 13 European + Brazil,China,
50 institutes)
• Tier-0
– CERN
• Tier-1
– RAL(UK),IN2P3(Lyon),INFN(Bologna),Nikhef,CERN + ?
• Tier-2
– Liverpool,Edinburgh/Glasgow,Switzerland + ? (maybe grow to
~10)
• Tier-3
– ~50 throughout the collaboration
• Ongoing negotations for centres (Tier-1/2/3)
– Germany,Russia,Poland,Spain,Brazil
• Current GRID involvement
– DataGrid (and national GRID efforts in UK,Italy,+..)
– Active in WP8 (HEP Applications) of Datagrid
– Will use middleware(WP 1-5) + Testbed(WP6) +
Network(WP7) + Security tools
3 Sept 2001
F HARRIS
CHEP, Beijing
3
Current MC production facilities
Centre
OS
CERN
RAL
IN2P3
Liverpool
Bologna
Linux
Linux
Linux
Linux
Linux
Max. # (av. #)
of
CPUs
used
simultaneously
315(60)
50(30)
225(60)
300(250)
20(20)
Batch
system
Typical weekly
production (#k
of events)
LSF
85
PBS
35
BQS
35
Custom 150
PBS
35
Percentage
submitted
through Grid
10%
100%
100%
0%
0%
• The max # of CPUs used simultaneously is usually less than the
capacity of the farm.
•Will soon extend to Nikhef, Edinburgh, Bristol
3 Sept 2001
F HARRIS
CHEP, Beijing
4
Distributed MC production, today
Submit jobs
remotely
via Web
Execute
on farm
Monitor
performance
of farm via
Web
3 Sept 2001
Transfer data to
CASTOR
mass-store at
CERN
Update bookkeeping
database (Oracle at
CERN)
Data Quality
Check on data
stored at CERN
F HARRIS
CHEP, Beijing
5
Distributed MC production in future (using DataGRID middleware)
Submit jobs
remotely
via Web
Execute
on farm
WP 1 job
submission
tools WP 4
environment
WP 1 job
submission
tools
Transfer data to
CASTOR (and
HPSS, RAL
Datastore)
Update bookkeeping
database
WP 2 data
replication
WP 5 API for
mass storage
WP 2 meta
data tools
WP1 tools
Monitor
performance
of farm via
Web 3 Sept 2001
WP 3
monitoring
tools
Data Quality
Check ‘Online’
F HARRIS
CHEP, Beijing
Online histogram
production using
GRID pipes
6
Use of Globus in tests and production
• Use of Globus simplifies remote production
– submit jobs through local Globus commands rather than remote
logon
• Some teething problems in tests(some due to learning
curve)
– Some limitations to the system (e.g. need large temporary space for
running jobs)
– Some mismatches between Globus and the PBS batch system (job
parameters ignored, submitting >100 jobs give problems)
•
DataGrid testbed organisation will ensure
synchronisation of versions at sites + Globus support
3 Sept 2001
F HARRIS
CHEP, Beijing
7
Security
• M9(October 2001...)
– Authorisation group
working towards tool
providing single log-on
and single role for
individual
• M21(October2002….)
– Single log-on firmly in
place. Moved to
structured VO with
(group,individual)
authorisation. Multiple
roles
– Individual will get
certificate from national
CA
– Must work out
administration for this at
start for experiment VO.
Probably ~10 users for
LHCb
3 Sept 2001
F HARRIS
– Maybe up to ~50 users
CHEP, Beijing
8
Job Submission
• M9
• M21
– Use command line
interface to WP1 JDL.
‘Static’ file specification.
– Interface to WP1 Job
Options via LHCb
application (GANGA).
Dynamic ‘file’
environment according to
application navigation
– May require access to
query language tools to
metadata
– Use environment
specification as agreed
with WP1,4 (no cloning)
3 Sept 2001
F HARRIS
– More comprehensive
environment specification
CHEP, Beijing
9
Job Execution
• M9
• M21
– Will run on farms at CERN,
Lyon, RAL for first tests
• Extend to Nikhef,
Bologna, Edinburgh once
we get stability
– Will use a very simple
environment (binaries)
– ‘Production’ flavour for
work
3 Sept 2001
F HARRIS
– Should be running on many
sites (? 20)
– Complete LHCb
environment for production
and development, without
AFS (use WP1 ‘sandboxes’)
– Should be testing user
analysis via GRID, as well as
performing production(~50)
CHEP, Beijing
10
Job Monitoring and data quality checking
• M9
• M21
– Monitor farms with homegrown tools via Web
– Use home-grown data
histogramming tools for
data monitoring
3 Sept 2001
F HARRIS
– Integrate WP3 tools for
farm performance (status
of jobs)
– Combine LHCb ideas on
state management and
data quality checking with
DataGrid software
CHEP, Beijing
11
Bookkeeping database
• M9
• M21
– Use current CERNcentric Oracle based
system
– Moved to WP2 metadata
handling tools ? ( ? Use of
LDAP, Oracle)
– This will be distributed
database handling using
facilities of replica
catalogue and replica
management
– LHCb must interface
applications view
(metadata) to GRID tools.
?query tools availability
3 Sept 2001
F HARRIS
CHEP, Beijing
12
Data copying and mass storage handling
• M9
• M21
– WP2 GDMP tool via
command line interface to
transfer Zebra format
files(control from LHCb
scripts)
– GDMP will be replaced by
smaller tools with API
interface. Copy Zebra
+Root + ?
– Tests of strategy driven
copying via replica
catalogue and replica
management
– WP5 interface to
CASTOR
3 Sept 2001
F HARRIS
– WP5 interfaces to more
mass storage devices.
(HPSS+RAL Datastore)
CHEP, Beijing
13
Gaudi Architecture
Application
Manager
Message
Service
JobOptions
Service
Particle Prop.
Service
Other
Services
3 Sept 2001
Converter
Converter
Converter
Event
Selector
Persistency
Service
Data
Files
Detec. Data
Service
Transient
Detector
Store
Persistency
Service
Data
Files
Histogram
Service
Transient
Histogram
Store
Persistency
Service
Data
Files
Event Data
Service
Transient
Event Store
Algorithm
Algorithm
Algorithm
F HARRIS
CHEP, Beijing
14
GAUDI services linking to external
services
Job
Service
Monitoring
Service
Application
Manager
Message
Service
Config.
Service
JobOptions
Service
PDG
Database
Particle Prop.
Service
Other
Services
Other
Other
3 Sept 2001
DataSet
DB
OS
Event
Transient
Selector
Mass
Storage
Converter
Converter
Converter
Persistency
Service
Event Data
Service
Transient
Event Store
Algorithm
Algorithm
Algorithm
Detec. Data Transient
Detector
Service
Store
Persistency
Service
Transient
Histogram
Store
Persistency
Service
Histogram
Service
Analysis Program
F HARRIS
CHEP, Beijing
Event
Database
Histo
Presenter
15
Another View
Algorithms
Gaudi Domain
API
Gaudi Services
API
Application external
Services
3 Sept 2001
F HARRIS
Grid Domain
CHEP, Beijing
16
GUI
GANGA: Gaudi ANd Grid Alliance
GANGA
Histograms
Monitoring
Results
JobOptions
Algorithms
Collective
&
Resource
Grid
Services
GAUDI Program
3 Sept 2001
F HARRIS
CHEP, Beijing
17
Conclusions
• LHCb already has distributed MC production using GRID
facilities for job submission
• Will test DataGrid M9 (Testbed1) deliverables in an incremental
manner from October 15 using tools from WP1-5
• Have commenced defining projects to interface software
framework (GAUDI) services (Event Persistency, Event
Selection, Job Options) to GRID services
• Within the WP8 structure we will work closely with the other
work packages (middleware,testbed,network) in a cycle of
(requirements analysis, design, implementation,testing)
• http://lhcb-comp.web.cern.ch/lhcb-comp/
• http://datagrid-wp8.web.cern.ch/DataGrid-WP8/
3 Sept 2001
F HARRIS
CHEP, Beijing
18