Transcript Document
CMS HLT production using
Grid tools
Flavia Donno (INFN Pisa)
Claudio Grandi (INFN Bologna)
Ivano Lippi (INFN Padova)
Francesco Prelz (INFN Milano)
Andrea Sciaba` (INFN Pisa)
Massimo Sgaravatto (INFN Padova)
Zhen Xie (INFN Pisa)
Introduction
Goals
Evaluate the existing GRID technologies with real
applications and on real production environments
Can these GRID tools be useful to “manage”
these HEP applications ?
Collaboration between:
CMS
INFN-GRID WP 1 (Installation and Evaluation of
the Globus toolkit) http://www.infn.it/globus
DataGrid WP 1 (Grid Workload Management)
M. Sgaravatto - INFN Padova
Applications
MB
HEPEVT
ntuples
Zebra files
with HITS
CMSIM
Catalog import
Objectivity
Database
Objectivity
Database
ORCA
ooHit
Formatter
Objectivity
Database
HLT Grp
Databases
HLT Algorithms
New
Reconstructed
Objects
M. Sgaravatto - INFN Padova
Objectivity
Objectivity
ytivitcejbO
Database
Database
esabataD
Mirrored Db’s
Catalog import
ORCA Prod.
ORCA
Digitization
(merge signal
and MB)
MC Prod.
Signal
Tested configuration for CMS production
Submit jobs
Production manager
condor_submit
(Globus Universe)
Condor-G as reliable,
crash-proof
submitting service
GRAM as uniform
interface to different
resource management
systems
Local
Resource
Management
Systems
Condor-G
Padova
Globus
GRAM
Globus
GRAM
CONDOR
LSF
CMS
Farms
Bologna
Pisa
M. Sgaravatto - INFN Padova
Overview
PC farms at each site installed and configured
using the CMS farm kickstart toolkit
PC farms managed by possible different local
resource management systems
Globus GRAM as uniform interface to the
different local resource management systems
Globus deployment using the INFNGRID
distribution toolkit (see Zhen’s presentation)
considering the INFN setup
M. Sgaravatto - INFN Padova
Overview
Condor-G as reliable, crash proof submitting service
Job submission and monitoring by the production
manager from a single machine
The production manager decides on which Globus
resource (farm) the job must be executed
Executable and input files stored on the executing
farm
Output files created on the executing machine
Log files created on the submitting machine
Authentication using Globus GSI (use of certificates
signed by INFN CA)
M. Sgaravatto - INFN Padova
Results
The CMS production using Globus and Condor-G
failed
Many many many memory leaks found in the Globus
jobmanager !!!
... but we (Francesco Prelz, INFN Milano) have been able to
provide fixes for these bugs
Fixes reported to Globus team
Feedback only for what concerning the bugs in the GAA and
GSS modules (new fixes “merged” with the original ones)
Work in progress
Tests with these fixes
Fixes included in the INFN-GRID distribution
M. Sgaravatto - INFN Padova
Other problems
Globus GRAM
Some minor bugs found and fixed (fixes included
in the INFN-GRID distribution)
Necessary to “address” some “major” problems
Scalability (one jobmanager for each job)
Reliability (the jobmanager is not persistent)
…
Condor-G
Some problems in the current implementation (it’s
a prototype)
Scalability in the submitting machine
Logging
M. Sgaravatto - INFN Padova
…
Next steps
New tests considering the next CMS productions with the
“patched” Globus jobmanager
New tests with the new implementations of Condor-G and
Globus jobmanager (by Condor team)
Tests with bypass
Tool written by D. Thain (Condor team) that allows redirection of
standard input/output/error to a remote machine (the submitting
machine) while the program is running (split execution system)
Use of GSI authentication mechanisms
New implementation reliable to several kind of failures
Tests with the first WP 1 prototype
“Integration” with software provided by the other WPs (i.e.
replica management tools, ..)
M. Sgaravatto - INFN Padova
Prototype workload management system
architecture
Resource
Discovery
Submit jobs
Master
(using Class-Ads)
Globus GRAM
as uniform interface
to different local resource
management systems
Local
Resource
Management
Systems
Farms
Site1
Grid Information
Service (GIS)
condor_submit
(Globus Universe)
Master chooses in which
Globus resources the jobs
must be submitted
Condor-G able to
provide a
reliable/crashproof job
submission service
Other
info
Information on
characteristics and
status of local resources
Condor-G
Globus
GRAM
Globus
GRAM
Globus
GRAM
CONDOR
LSF
PBS
Site2
M. Sgaravatto - INFN Padova
Site3