U.S. ATLAS Testbed Status Report

Download Report

Transcript U.S. ATLAS Testbed Status Report

Experience with ATLAS
Data Challenge Production
on the U.S. Grid Testbed
Kaushik De
University of Texas at Arlington
CHEP03
March 27, 2003
The ATLAS Experiment
•Multi-purpose experiment at the Large
Hadron Collider, CERN
•14 GeV c.m. pp collisions starting in 2007
•Physics: Higgs, SUSY, new searches...
•Petabytes/year of data analyzed by >2000
physicists worldwide - need the GRID
K. De
CHEP03
March 27, 2003
2
U.S. ATLAS Grid Testbed
BNL - U.S. Tier 1, 2000
nodes, 5% ATLAS, 10 TB
UT Arlington - 20 nodes
LBNL - pdsf cluster, 400
nodes, 5% ATLAS, 1 TB
U. Michigan - 10 nodes
Boston U. - prototype
Tier 2, 64 nodes
SMU - 6 nodes
Indiana U. - prototype
Tier 2, 32 nodes
K. De
CHEP03
Oklahoma U. - 12 nodes
ANL - test nodes
UNM - new site
March 27, 2003
3
U.S. Testbed Goals
 Deployment

Set up grid infrastructure and ATLAS software

Test installation procedures (PACMAN)
 Development & Testing

Grid applications - GRAT, Grappa, Magda...

Other software - monitoring, packaging...
 Run Production

For U.S. physics data analysis and tests
 Main
focus - ATLAS Data Challenges

Simulation, pileup

Reconstruction
 Connection to GRID projects
K. De

GriPhyN - Globus, Condor, Chimera… use & test

iVDGL - VDT, glue schema testbed, Worldgrid
testbed, demos… use and test

EDG, LCG… testing & deployment
CHEP03
March 27, 2003
4
ATLAS Data Challenges
DC’s - Generate and analyse simulated data
(see talk by Gilbert Poulard on Tuesday)
 Original Goals (Nov 15, 2001)

Test computing model, its software, its data
model, and to ensure the correctness of the
technical choices to be made

Data Challenges should be executed at the
prototype Tier centres

Data challenges will be used as input for a
Computing Technical Design Report due by the
end of 2003 (?) and for preparing a MoU
 Current Status
K. De

Goals are evolving as we gain experience

Sequence of increasing scale & complexity

DC0 (completed), DC1 (underway)

DC2, DC3, and DC4 planned

Grid deployment and testing major part of DC’s
CHEP03
March 27, 2003
5
GRAT Software
 GRid Applications Toolkit
 Used for U.S. Data Challenge production
 Based on Globus, Magda & MySQL
 Shell & Python scripts, modular design
 Rapid development platform

Quickly develop packages as needed by DC

Single particle production

Higgs & SUSY production

Pileup production & data management

Reconstruction
 Test grid middleware, test grid performance
 Modules can be easily enhanced or
replaced by Condor-G, EDG resource
broker, Chimera, replica catalogue,
OGSA… (in progress)
K. De
CHEP03
March 27, 2003
6
GRAT Execution Model
9
Prod.
(UTA)
MAGDA
(BNL)
8
Replica
(local)
2
4
1,4,5,10
DC1
scratch
Remote
Gatekeeper
7
5
Batch
Execution
6
3
1. Resource Discovery
2. Partition Selection
3. Job Creation
4. Pre-stage
5. Batch Submission
6. Job Parameterization
7. Simulation
8. Post-stage
9. Cataloging
10. Monitoring
K. De
CHEP03
Param
(CERN)
March 27, 2003
7
Middleware Evolution of
U.S. Applications
Used in current
production software
(GRAT & Grappa)
Tested successfully
(not yet used for large
scale production)
Under development
and testing
Tested for simulation
(will be used for large
scale reconstruction)
K. De
CHEP03
March 27, 2003
8
Databases used in GRAT
 MySQL databases central to GRAT
 Production database

define logical job parameters & filenames

track job status, updated periodically by scripts
 Data management (Magda)

file registration/catalogue

grid based file transfers
 Virtual Data Catalogue

simulation job definition

job parameters, random numbers
 Metadata catalogue (AMI)

post-production summary information

data provenance
 Similar scheme being considered ATLASwide by the Grid Technical Board
K. De
CHEP03
March 27, 2003
9
DC1 Production on U.S. Grid
 August/September 2002

3 week DC1 production run using GRAT

Generated 200,000 events, using ~ 1,300 CPU
days, 2000 files, 100 GB storage at 4 sites
 December 2002

Generated 75k SUSY and Higgs events for DC1

Total DC1 files generated and stored > 500 GB,
total CPU used >1000 CPU days in 4 weeks
 January 2002

More SUSY sample

Started pile-up production on the grid, both high
and low luminosity, for 1-2 months at all sites
 February/March 2002
K. De

Discovered bug in software (non grid part)

Regenerating all SUSY, Higgs & pile-up samples

~15TB data, 15k files, 2M events, 10k CPU days
CHEP03
March 27, 2003
10
DC1 Production Examples
Each production run requires development &
deployment of new software at selected sites
100
90
80
70
60
50
40
30
20
10
0
10
/1
0
10
/3
9/
26
9/
19
9/
12
9/
5
8/
29
8/
22
UTA
OU
LBL
8/
15
number of jobs
DC simulation Aug/Sep 2002
date
High Lum inosity pileup Feb/Mar 2003
250
Output files
200
150
UTA
BU
LBNL
BNL
100
50
5-Mar
4-Mar
3-Mar
2-Mar
1-Mar
28-Feb
27-Feb
26-Feb
25-Feb
24-Feb
23-Feb
22-Feb
21-Feb
20-Feb
19-Feb
18-Feb
17-Feb
0
Date
K. De
CHEP03
March 27, 2003
11
DC1 Production Experience
 Grid paradigm works, using Globus

Opportunistic use of existing resources, run
anywhere, from anywhere, by anyone...
 Successfully exercised grid middleware
with increasingly complex tasks

Simulation: create physics data from pre-defined
parameters and input files, CPU intensive

Pile-up: mix ~2500 min-bias data files into
physics simulation files, data intensive

Reconstruction: data intensive, multiple passes

Data tracking: multiple steps, one -> many ->
many more mappings
 Tested grid applications developed by U.S.
K. De

For example, PACMAN (Saul Youssef - BU)

Magda (see talk by Wensheng Deng)

Virtual Data Catalogue (see Poster by P. Nevski)

GRAT (this talk), GRAPPA (see talk by D. Engh)
CHEP03
March 27, 2003
12
Grid Quality of Service
 Anything that can go wrong, WILL go wrong

During 18 days of grid production (in August),
every system died at least once

Local experts were not always be accessible

Examples: scheduling machines died 5 times
(thrice power failure, twice system hung),
Network outages multiple times, Gatekeeper
died at every site at least 2-3 times

Three databases used - production, magda and
virtual data. Each died at least once!

Scheduled maintenance - HPSS, Magda server,
LBNL hardware, LBNL Raid array…

Poor cleanup, lack of fault tolerance in Globus
 These outages should be expected on the
grid - software design must be robust
 We managed > 100 files/day (~80%
efficiency) in spite of these problems!
K. De
CHEP03
March 27, 2003
13
Conclusion
 The largest (>10TB) grid based production
in ATLAS was done by U.S. testbed
 Grid production is possible, but not easy
right now - need to harden middleware,
need higher level services
 Many tools are missing - monitoring,
operations center, data management
 Requires iterative learning process, with
rapid evolution of software design
 Pile-up was a major data management
challenge on the grid - moving >0.5 TB/day
 Successful so far
 Continuously learning and improving
 Many more DC’s coming up!
K. De
CHEP03
March 27, 2003
14