ATLAS and Grid Computing RWL Jones GridPP 13 5th July 2005 ATLAS Computing Timeline • POOL/SEAL release (done) • ATLAS release 7 (with POOL.

Download Report

Transcript ATLAS and Grid Computing RWL Jones GridPP 13 5th July 2005 ATLAS Computing Timeline • POOL/SEAL release (done) • ATLAS release 7 (with POOL.

ATLAS and Grid Computing
RWL Jones
GridPP 13 5th July 2005
ATLAS Computing Timeline
2003
• POOL/SEAL release (done)
• ATLAS release 7 (with POOL persistency) (done)
• LCG-1 deployment (done)
2004
• ATLAS complete Geant4 validation (done)
• ATLAS release 8 (done)
• DC2 Phase 1: simulation production (done)
2005
NOW
2006
• DC2 Phase 2: intensive reconstruction (the real challenge!) LATE!
• Combined test beams (barrel wedge) (done)
• Computing Model paper (done)
• Computing Memorandum of Understanding (done)
• Computing System Commissioning
• Physics Readiness Report
2007
Commissioning
takes priority!
• ATLAS Computing TDR and LCG TDR (in progress)
• Start cosmic ray run
• GO!
Computing TDR
structure
The TDR describes the whole Software & Computing Project
as defined within the ATLAS organization:
Major activity areas within the S&C Project
Liaisons to other ATLAS projects
Massive productions on
3 Grids
Massive productions on
3 Grids (3)
• July-September 2004: DC2 Geant-4 simulation (long jobs)
– 40% on LCG/EGEE Grid, 30% on Grid3 and 30% on NorduGrid
• February-May 2005: Rome production
– 70% on LCG/EGEE Grid, 25% on Grid3, 5% on NorduGrid
• LCG/EGEE Grid resources always difficult to saturate with
“traditional” means
– New approach (Lexor-CondorG) used Condor-G to submit directly to
the sites
• in this way the job rate was doubled on the same total available resources
– much more efficient usage of the CPU resources
– the same approach is now evaluated also for the Grid3/OSG Grid job
submission which suffered also from job rate problems
Massive productions on
3 Grids (4)
• 73 data sets containing 6.1M events simulated and reconstructed
(without pile-up)
• Total simulated data: 8.5M events
• Pile-up done later (for 1.3M events done up to last week)
Number of Jobs
ATLAS Rome Production - Number of Jobs
N orduG rid
11%
G rid3
24%
G rid3
LC G
LC G-C G
N orduG rid
LC G-C G
31%
573315 jobs
22 countries
LC G
34%
84 sites
uibk.ac.at
umomtreal.ca
cern.ch
csvs.ch
skurut.cz
atlas.fzk.de
benedict.dk
morpheus.dk
ft.uam.es
marseille.fr
clrece.in2p3.fr
isabella.gr
cnaf.it
roma1.it
ba.it
lnf.it
to.it
ct.it
fe.it
roma2.it
pi.it
nikhef.nl
hypatia.no
lip.pt
hagrid.se
sigrid.se
chalmers.se
savka.sk
sinica.tw
shef.uk
ucl.uk
lancs.uk
ed.uk
BNL.us
UC_ATLAS.us
FNAL.us
OU.us
Hamptom.us
UCSanDiego.us
SMU.us
ANL.us
UC.us
Unknown
triumf.ca
utoronto.ca
unibe.ch
golias.cz
gridka.fzk.de
lcg-gridka.fzk.de
nbi.dk
ific.uv.es
ifae.es
cclcgcdli.in2p3.fr
cea.fr
kfki.hu
lnl.it
mi.it
pd.it
na.it
fi.it
ca.it
pd.it
bo.it
sara.nl
uio.no
zeus.pl
msu.ru
bluesmoke.se
pdc.se
brenta.si
ihep.su
ral.uk
ox.uk
ic.uk
man.uk
UTA.us
BU.us
PDSF.us
IU.us
PSU.us
UNM.us
Uflorida.su
CalTech.us
UWMadison.us
Rice.us
Experience with LCG-2
Operations
• Support for our productions was excellent from the CERN-ITEIS team
• Other LCG/EGEE structures were effectively invisible (GOC,
ROCs, GGUS etc)
– no communication line between experiments and the Grid Operations
Centres
– operational trouble info always through the EIS group
– sites scheduled major upgrades or downtimes during our productions
• no concept of “service” for the service providers yet!
• many sites consider themselves as part of a test structure set up (and
funded) by EGEE
• but we consider the LCG Grid as an operational service for us!
– many sites do not have the concept of “permanent disk storage” in a
Storage Element
• if they change something in their filing system, our catalogue has to be
updated!
Second ProdSys
development cycle
• The experience with DC2 and the Rome production taught
us that we had to re-think at least some of the ProdSys
components
• The ProdSys review defined the way forward:
– Frederic Brochu one of the reviewers
– Keep the global ProdSys architecture (system decomposition)
– Replace or re-work all individual components to address the
identified shortcomings of Grid middleware:
• reliability and fault tolerance first of all
– Re-design the Distributed Data Management system to avoid single
points of failure and scaling problems
• Work is now underway
– target is end of Summer for integration tests
– ready for LCG Service Challenge 3 from October onwards
Distributed Data
Management
•
•
•
•
Accessing distributed data on the Grid is not a simple task
Several central DBs are needed to hold dataset information
“Local” catalogues hold information on local data storage
The new DDM system
(right) is under test
this summer
• It will be used
for all ATLAS data
from October on
(LCG Service
Challenge 3)
• Affects GridPP effort
Computing Operations
• The Computing Operations organization likely to change:
a) Grid Tools
b) Grid operations:
 Tier-0 operations
 re-processing of real and simulated data at Tier-1's
 data distribution and placement
 Software distribution and installation
 Site and software installation validation and monitoring
 Coordination of Service Challenges in 2005-2006
 User Support
 Proposal to use Frederic Brochu in front-line triage
 Credited contribution
 Contingent on Distributed Analysis planning
Software Installation
• Software installation continues to be a challenge
– Rapid roll-out of release to the Grid important for
ATLAS UK eScience goals (3.1.4)
– Vital for user code in distributed analysis
• Grigori Rybkine (50/50 GridPP/ATLAS eScience):
– Working towards 3.1.5, kit installation and package
management in distributed analysis
– Package manager implementation supports tarball and
locally-built code
– Essential support role
– 3.1.5 progressing well, 3.1.4 may have some delays
because of external effort in nightly deployable
packages
Current plans for
EGEE/gLite
• Ready to test new components as soon as they are
released from the internal certification process
– assume the LCG Baseline Services
• Only seen the File Transfer Service & LCG File Catalogue
– both being actively tested by our DDM group
– FTS will be field-tested by Service Challenge 3 starting in July
– LFC is in our plan for the new DDM (Summer deployment)
• Not really seen the new Workload Management System
nor the new Computing Element
– some ATLAS informal access to pre-release versions
• As soon as the performance is acceptable we will ask to
have them deployed
– this is NOT a blank check!
Distributed Analysis
System
• ATLAS and GANGA work now focused on Distributed Analysis
• LCG RTAG 11 in 2003 did not produce a common analysis system project as
hoped. ATLAS therefore planned to combine the strengths of various
existing prototypes:
– GANGA provides a Grid front-end for Gaudi/Athena jobs
– DIAL provides fast, quasi-interactive, access to large local clusters
– The ATLAS Production System to interface to the 3 Grid flavours
• Alvin Tan
– Work on the job-building GUI and Job Options Editor well received
• Wish from LBL to merge JOE with Job Options Tracer project
– Monitoring work also well received – prototypes perform well.
• Frederic Brochu
– Provided beta version of new job submission from GANGA direct to Production
System
Distributed Analysis
System (2)
• Currently reviewing this activity to define a
baseline for the development of start-up
Distributed Analysis System
– All this has to work together with the DDM system
described earlier
– Decide a baseline “now”, so we can have a
testable system by this autumn
– The outcome of the review may change GridPP
plans
Conclusions
• ATLAS is (finally) getting effective
throughput from LCG
• The UK effort is making an important
contribution
• The Distributed Analysis is continuing to
pose a big challenge
– ATLAS is taking the right management
approach
– GridPP effort will have to be responsive