U.S. ATLAS Testbed Status Report
Download
Report
Transcript U.S. ATLAS Testbed Status Report
Experience with ATLAS
Data Challenge Production
on the U.S. Grid Testbed
Kaushik De
University of Texas at Arlington
CHEP03
March 27, 2003
The ATLAS Experiment
•Multi-purpose experiment at the Large
Hadron Collider, CERN
•14 GeV c.m. pp collisions starting in 2007
•Physics: Higgs, SUSY, new searches...
•Petabytes/year of data analyzed by >2000
physicists worldwide - need the GRID
K. De
CHEP03
March 27, 2003
2
U.S. ATLAS Grid Testbed
BNL - U.S. Tier 1, 2000
nodes, 5% ATLAS, 10 TB
UT Arlington - 20 nodes
LBNL - pdsf cluster, 400
nodes, 5% ATLAS, 1 TB
U. Michigan - 10 nodes
Boston U. - prototype
Tier 2, 64 nodes
SMU - 6 nodes
Indiana U. - prototype
Tier 2, 32 nodes
K. De
CHEP03
Oklahoma U. - 12 nodes
ANL - test nodes
UNM - new site
March 27, 2003
3
U.S. Testbed Goals
Deployment
Set up grid infrastructure and ATLAS software
Test installation procedures (PACMAN)
Development & Testing
Grid applications - GRAT, Grappa, Magda...
Other software - monitoring, packaging...
Run Production
For U.S. physics data analysis and tests
Main
focus - ATLAS Data Challenges
Simulation, pileup
Reconstruction
Connection to GRID projects
K. De
GriPhyN - Globus, Condor, Chimera… use & test
iVDGL - VDT, glue schema testbed, Worldgrid
testbed, demos… use and test
EDG, LCG… testing & deployment
CHEP03
March 27, 2003
4
ATLAS Data Challenges
DC’s - Generate and analyse simulated data
(see talk by Gilbert Poulard on Tuesday)
Original Goals (Nov 15, 2001)
Test computing model, its software, its data
model, and to ensure the correctness of the
technical choices to be made
Data Challenges should be executed at the
prototype Tier centres
Data challenges will be used as input for a
Computing Technical Design Report due by the
end of 2003 (?) and for preparing a MoU
Current Status
K. De
Goals are evolving as we gain experience
Sequence of increasing scale & complexity
DC0 (completed), DC1 (underway)
DC2, DC3, and DC4 planned
Grid deployment and testing major part of DC’s
CHEP03
March 27, 2003
5
GRAT Software
GRid Applications Toolkit
Used for U.S. Data Challenge production
Based on Globus, Magda & MySQL
Shell & Python scripts, modular design
Rapid development platform
Quickly develop packages as needed by DC
Single particle production
Higgs & SUSY production
Pileup production & data management
Reconstruction
Test grid middleware, test grid performance
Modules can be easily enhanced or
replaced by Condor-G, EDG resource
broker, Chimera, replica catalogue,
OGSA… (in progress)
K. De
CHEP03
March 27, 2003
6
GRAT Execution Model
9
Prod.
(UTA)
MAGDA
(BNL)
8
Replica
(local)
2
4
1,4,5,10
DC1
scratch
Remote
Gatekeeper
7
5
Batch
Execution
6
3
1. Resource Discovery
2. Partition Selection
3. Job Creation
4. Pre-stage
5. Batch Submission
6. Job Parameterization
7. Simulation
8. Post-stage
9. Cataloging
10. Monitoring
K. De
CHEP03
Param
(CERN)
March 27, 2003
7
Middleware Evolution of
U.S. Applications
Used in current
production software
(GRAT & Grappa)
Tested successfully
(not yet used for large
scale production)
Under development
and testing
Tested for simulation
(will be used for large
scale reconstruction)
K. De
CHEP03
March 27, 2003
8
Databases used in GRAT
MySQL databases central to GRAT
Production database
define logical job parameters & filenames
track job status, updated periodically by scripts
Data management (Magda)
file registration/catalogue
grid based file transfers
Virtual Data Catalogue
simulation job definition
job parameters, random numbers
Metadata catalogue (AMI)
post-production summary information
data provenance
Similar scheme being considered ATLASwide by the Grid Technical Board
K. De
CHEP03
March 27, 2003
9
DC1 Production on U.S. Grid
August/September 2002
3 week DC1 production run using GRAT
Generated 200,000 events, using ~ 1,300 CPU
days, 2000 files, 100 GB storage at 4 sites
December 2002
Generated 75k SUSY and Higgs events for DC1
Total DC1 files generated and stored > 500 GB,
total CPU used >1000 CPU days in 4 weeks
January 2002
More SUSY sample
Started pile-up production on the grid, both high
and low luminosity, for 1-2 months at all sites
February/March 2002
K. De
Discovered bug in software (non grid part)
Regenerating all SUSY, Higgs & pile-up samples
~15TB data, 15k files, 2M events, 10k CPU days
CHEP03
March 27, 2003
10
DC1 Production Examples
Each production run requires development &
deployment of new software at selected sites
100
90
80
70
60
50
40
30
20
10
0
10
/1
0
10
/3
9/
26
9/
19
9/
12
9/
5
8/
29
8/
22
UTA
OU
LBL
8/
15
number of jobs
DC simulation Aug/Sep 2002
date
High Lum inosity pileup Feb/Mar 2003
250
Output files
200
150
UTA
BU
LBNL
BNL
100
50
5-Mar
4-Mar
3-Mar
2-Mar
1-Mar
28-Feb
27-Feb
26-Feb
25-Feb
24-Feb
23-Feb
22-Feb
21-Feb
20-Feb
19-Feb
18-Feb
17-Feb
0
Date
K. De
CHEP03
March 27, 2003
11
DC1 Production Experience
Grid paradigm works, using Globus
Opportunistic use of existing resources, run
anywhere, from anywhere, by anyone...
Successfully exercised grid middleware
with increasingly complex tasks
Simulation: create physics data from pre-defined
parameters and input files, CPU intensive
Pile-up: mix ~2500 min-bias data files into
physics simulation files, data intensive
Reconstruction: data intensive, multiple passes
Data tracking: multiple steps, one -> many ->
many more mappings
Tested grid applications developed by U.S.
K. De
For example, PACMAN (Saul Youssef - BU)
Magda (see talk by Wensheng Deng)
Virtual Data Catalogue (see Poster by P. Nevski)
GRAT (this talk), GRAPPA (see talk by D. Engh)
CHEP03
March 27, 2003
12
Grid Quality of Service
Anything that can go wrong, WILL go wrong
During 18 days of grid production (in August),
every system died at least once
Local experts were not always be accessible
Examples: scheduling machines died 5 times
(thrice power failure, twice system hung),
Network outages multiple times, Gatekeeper
died at every site at least 2-3 times
Three databases used - production, magda and
virtual data. Each died at least once!
Scheduled maintenance - HPSS, Magda server,
LBNL hardware, LBNL Raid array…
Poor cleanup, lack of fault tolerance in Globus
These outages should be expected on the
grid - software design must be robust
We managed > 100 files/day (~80%
efficiency) in spite of these problems!
K. De
CHEP03
March 27, 2003
13
Conclusion
The largest (>10TB) grid based production
in ATLAS was done by U.S. testbed
Grid production is possible, but not easy
right now - need to harden middleware,
need higher level services
Many tools are missing - monitoring,
operations center, data management
Requires iterative learning process, with
rapid evolution of software design
Pile-up was a major data management
challenge on the grid - moving >0.5 TB/day
Successful so far
Continuously learning and improving
Many more DC’s coming up!
K. De
CHEP03
March 27, 2003
14