Moving the LHCb Monte Carlo production system to the GRID

Download Report

Transcript Moving the LHCb Monte Carlo production system to the GRID

LHCb Distributed Computing
and the Grid
Nick Brook
University of Bristol
D. Galli, U. Marconi, V. Vagnoni
N. Brook
K. Harrison
E. Van Herwijnen, J. Closier, P. Mato
A. Khan
A. Tsaregorodtsev
H. Bulten, S. Klous
F. Harris, I. McArthur, A. Soroko
G. N. Patrick, G. Kuznetsov
27th June 2002
Nick Brook
ACAT' 02
INFN Bologna
Bristol
Cambridge
CERN
Edinburgh
Marseille
Nikhef
Oxford
RAL
1
Overview of presentation
• Current organisation of LHCb distributed
computing
• UK facilities and support through GridPP
• Current use of Globus and EDG middleware
• Planning for data challenges and the use of Grid
• Current LHCb Grid/applications R/D
• Conclusions
27th June 2002
Nick Brook
ACAT' 02
2
History of distributed MC production
• Distributed System has been running for 3+ years &
processed many millions of events for LHCb design.
• Main production sites:
– CERN, Bologna, Liverpool, Lyon, NIKHEF & RAL
• Globus already used for job submission to RAL and
Lyon
• System interfaced to GRID and demonstrated at EUDG Review and NeSC/UK Opening.
• For 2002 Data Challenges, adding new institutes:
– Bristol, Cambridge, Oxford, ScotGrid
• In 2003, add
– Barcelona, Moscow, Germany, Switzerland & Poland.
27th June 2002
Nick Brook
ACAT' 02
3
Current Architecture
Production Manager
Physics
Coordinator
Create no. of jobs (500 events each)
Determine configuration
Run executable
Check data
Copy data/logs
Physicist
Job Creation/Submission via Web
Identify outstanding requests
Select workflow
Create scripts via Java servlets.
Monitoring via PVSS
Submit jobs to distributed sites
See what jobs are running
Check configuration
Kill jobs, etc
Bookkeeping
Database
27th June 2002
Nick Brook
ACAT' 02
4
LOGICAL
FLOW
Submit jobs
remotely
via Web
Analysis
Execute
on farm
Data quality
check
Update
bookkeeping
database
Transfer
data to
mass store
27th June 2002
Nick Brook
ACAT' 02
5
Monitoring and Control
of MC jobs
• LHCb has adopted PVSS II as prototype
control and monitoring system for MC
production.
– PVSS is a commercial SCADA (Supervisory
Control And Data Acquisition) product
developed by ETM.
– Adopted as Control framework for LHC
Joint Controls Project (JCOP).
– Available for Linux and Windows platforms.
27th June 2002
Nick Brook
ACAT' 02
6
27th June 2002
Nick Brook
ACAT' 02
7
UK Tier 1 - RAL
New Computing Farm
4 racks holding 156 dual
1.4GHz Pentium III cpus.
Each box has 1GB of
memory, a 40GB internal
disk and 100Mb
ethernet.
Tape Robot
upgraded last year
uses 60GB STK 9940
tapes 45TB current
capacity could hold
330TB.
50TByte disk-based
Mass Storage Unit
after RAID 5 overhead.
PCs are clustered on
network switches with
up to 8x1000Mb
ethernet out of each
rack.
27th June 2002
2004 Scale: 1000 CPUs
0.5 PBytes
Nick Brook
ACAT' 02
8
UK Regional Centres
Local Perspective:
Consolidate
Research Computing
Optimisation of Number of
Nodes?
4
Relative size dependent
on funding dynamics
27th June 2002
Nick Brook
ACAT' 02
9
UK Prototype Tier2 - ScotGrid
ScotGrid Processing nodes at Glasgow
59 IBM X Series 330 dual 1 GHz Pentium III
with 2GB memory
• 2 IBM X Series 340 dual 1 GHz Pentium III with
2GB memory and dual ethernet
• 3 IBM X Series 340 dual 1 GHz Pentium III with
2GB memory and 100 + 1000 Mbit/s ethernet
• 1TB disk
• LTO/Ultrium Tape Library
• Cisco ethernet switches
ScotGrid Storage at
Edinburgh
• IBM X Series 370
PIII Xeon with 512
MB memory 32 x 512
MB RAM
• 70 x 73.4 GB IBM
FC Hot-Swap HDD
27th June 2002
2004 Scale: 300 CPUs
0.1 PBytes
Nick Brook
ACAT' 02
10
GridPP support
 2 LHCb posts:
• to work on Gaudi (software framework) persistency
services
• to work on MC monitoring and control software
 2 ATLAS/LHCb :
• Gaudi/GANGA posts:
• Interface between software framework and Grid services
27th June 2002
Nick Brook
ACAT' 02
11
Current Use of Grid Middleware
in development system
• Authentication
– grid-proxy-init
• Job submission to DataGrid
– dg-job-submit
• Monitoring and control
– dg-job-status
– dg-job-cancel
– dg-job-get-output
• Data publication and replication
– globus-url-copy, GDMP
• Resource scheduling – use of CERN MSS
– JDL, sandboxes, storage elements
27th June 2002
Nick Brook
ACAT' 02
12
Example 1:
Job Submission
dg-job-submit /home/evh/sicb/sicb/bbincl1600061.jdl -o /home/evh/logsub/
bbincl1600061.jdl:
#
Executable = "script_prod";
Arguments = "1600061,v235r4dst,v233r2";
StdOutput = "file1600061.output";
StdError = "file1600061.err";
InputSandbox =
{"/home/evhtbed/scripts/x509up_u149","/home/evhtbed/sicb/mcsend","/home/ev
htbed/sicb/fsize","/home/evhtbed/sicb/cdispose.class","/home/evhtbed/v235r
4dst.tar.gz","/home/evhtbed/sicb/sicb/bbincl1600061.sh","/home/evhtbed/scr
ipt_prod","/home/evhtbed/sicb/sicb1600061.dat","/home/evhtbed/sicb/sicb160
0062.dat","/home/evhtbed/sicb/sicb1600063.dat","/home/evhtbed/v233r2.tar.g
z"};
OutputSandbox =
{"job1600061.txt","D1600063","file1600061.output","file1600061.err","job16
00062.txt","job1600063.txt"};
27th June 2002
Nick Brook
ACAT' 02
13
Example 2:
Data Publishing & Replication
Compute Element
Storage Element
MSS
Local disk
Job
Data
globus-url-copy
register-local-file
Data
publish
CERN TESTBED
Replica
Catalogue
NIKHEF - Amsterdam
REST-OF-GRID
Job
replica-get
Data
Storage Element
27th June 2002
Nick Brook
ACAT' 02
14
LHCb Data Challenge 1
(July-September 2002)
• Physics Data Challenge (PDC) for detector, physics
and trigger evaluations
– based on existing MC production system – small amount
of Grid tech to start with
– Generate ~3*107 events (signal + specific background +
generic b and c + min bias)
• Computing Data Challenge (CDC) for checking
developing software
– will make more extensive use of Grid middleware
• Components will be incorporated into PDC once
proven in CDC
27th June 2002
Nick Brook
ACAT' 02
15
LHCb software framework - Gaudi
Converter
Converter
Converter
Application
Manager
Message
Service
JobOptions
Service
Particle Prop.
Service
Other
Services
27th June 2002
Event Data
Service
Persistency
Service
Data
Files
Transient
Event Store
Algorithm
Algorithm
Algorithm
Detec. Data
Service
Transient
Detector
Store
Persistency
Service
Data
Files
Histogram
Service
Transient
Histogram
Store
Persistency
Service
Data
Files
Nick Brook
ACAT' 02
16
GANGA: Gaudi ANd Grid Alliance
Joint Atlas (C. Tull) and LHCb (P. Mato) project,
formally supported by GridPP/UK with 2 joint
Atlas/LHCb research posts at Cambridge and Oxford
• Application facilitating end-user physicists and
production managers the use of Grid services for
running Gaudi/Athena jobs.
GUI
• a GUI based application that
should help for the complete
job life-time:
GANGA
- job preparation and
Collective
Histograms
&
configuration
JobOptions
Monitoring Resource
Algorithms
Results
Grid
- resource booking
Services
- job submission
GAUDI Program
- job monitoring and control
27th June 2002
Nick Brook
ACAT' 02
17
Required functionality
• Before Gaudi/Athena program starts
– Security (obtaining certificates and credentials)
– Job configuration (algorithm configuration, input data
selection, ...)
– Resource booking and policy checking (CPU, storage,
network)
– Installation of required software components
– Job preparation and submission
• While Gaudi/Athena program is running:
– Job monitoring (generic and specific)
– Job control (suspend, abort, ...)
• After program has finished:
– Data management (registration)
27th June 2002
Nick Brook
ACAT' 02
18
Local
user
GUI
Python Bus Design
(A possible model for implementation)
PYTHON SW BUS
GaudiPython
Java Module
OS Module
EDG API
PythonROOT
Workspaces
DB
GAUDI
client
Remote
user
HTML page
27th June 2002
Athena\
GAUDI
GRID
Bookkeeping
DB
Job
Configuration
DB
Internet
Nick Brook
Production
DB
ACAT' 02
19
Conclusions
• LHCb already has distributed MC production using GRID
facilities for job submission
• We are embarking on large scale data challenges commencing
July 2002, and we are developing our analysis model
• Grid middleware will be being progressively integrated into
our production environment as it matures (starting with
EDG, and looking forward to GLUE)
• R/D projects are in place
– for interfacing users (production + analysis) and Gaudi/Athena
software framework to Grid services
– for putting production system into integrated Grid environment
with monitoring and control
• All work being conducted in close participation with EDG and
LCG projects
– Ongoing evaluations of EDG middleware with physics jobs
– Participate in LCG working groups e.g. Report on ‘Common use
cases for a HEP Common Application layer’
http://cern.ch/fca/HEPCAL.doc
27th June 2002
Nick Brook
ACAT' 02
20