ATLAS planning for Atlas/EDG project work to Sep 16

Download Report

Transcript ATLAS planning for Atlas/EDG project work to Sep 16

ATLAS Data Challenges on EDG

Oxana Smirnova LCG/ATLAS/Lund [email protected]

November 11, 2002, Uppsala 4th NorduGrid Workshop

EU Datagrid project

   Started on January 1, 2001, to deliver by end 2003   Aim: to develop a Grid middleware suitable for High Energy physics, Earth Observation and biology applications Initial development based on existing tools, e.g., Globus, LCFG, GDMP etc The core testbed consists of the central site at CERN and few facilities across the Western Europe; many more sites are foreseen to join soon   Italy, UK come with several sites each; Spain, Germany and others – via the Crossgrid ATLAS-affiliated sites: Canada, Taiwan etc By now reached the stability level sufficient to test submission of production-style tasks 2002-11-11 [email protected]

2

EDG Testbed

   EDG is committed to create a stable testbed to be used by applications for real tasks   This started to materialize in mid August… …and coincided with the ATLAS DC1  ATLAS asked and was given the first priority Most sites are installed from scratch using the EDG tools (RedHat 6.2 based)    NIKHEF: EDG installation and configuration

only

Lyon: installation on the top of existing farm A lightweight EDG installation is available Central element: the Resource Broker (RB), distributes jobs between the resources   Currently, only one RB (CERN) is available for applications In future, may be an RB per Virtual Organization (VO) [email protected]

3 2002-11-11

Chart borrowed from Guido Negri’s slides

EDG functionality as of today

RC CE rfcp GDMP or RM CASTOR LDAP GDMP or RM

testbed010.cern.ch

UI do rfcp jdl replicate RB

2002-11-11 lxshare0393.cern.ch

lxshare033.cern.ch

[email protected]

NFS GDMP or RM Input RSL Output

lxshare0399.cern.ch

rfcp CE CE

4

ATLAS-EDG Task Force

    ATLAS is eager to use Grid tools for the Data Challenges  ATLAS Data Challenges are already on the Grid (NorduGrid, VDT)  The DC1/phase2 (starting now) is expected to be done using the Grid tools to a bigger extent ATLAS-EDG Task Force was put together in August with the aims:   To assess the usability of the EDG production tasks testbed for the immediate To introduce the Grid awareness to the ATLAS collaboration The Task Force has representatives both from ATLAS and EDG: 40+ members (!) on the mailing list, ca 10 of them working nearly full-time The initial task: to process 5 input partitions of the Dataset 2000 the EDG Testbed + one non-EDG site (Karlsruhe); if this works, continue with other datasets at 2002-11-11 [email protected]

5

Execution of jobs

     It was expected that we can make full use of the

Resource Broker

functionality   Data-driven job steering Best available resources otherwise Input files are pre-staged once (copied from

CASTOR

replicated elsewhere) and A job consists of the

standard DC1 shell-script

, very much the way it is done on a conventional cluster A

Job Definition Language

 is used to wrap up the job, specifying: The executable file (script)   Input data Files to be retrieved manually by the user  Optionally, other attributes (maxCPU, Rank etc) Storage and registration of output files is a part of the job script: i.e.,

application manages output

data the way it needs 2002-11-11 [email protected]

6

Hurdles

     EDG can not replicate files directly from CASTOR and can not register them in the Replica Catalog Replication was done via CERN SE; EDG is working on a better (though temporary) solution. CASTOR team writes a GridFTP interface, which will help a lot.

 Also known Globus GridFTP server problem, temporary fixed by using multi-threaded GridFTP instead of EDG tools Jobs were “lost” by the system after 20 minutes of execution  Big file transfer interrupts after 21 minutes Known problem of the Globus software (GASS Cache mechanism), temporary fixed on expense of frequent job submission  Static information system: if a site goes down, it should be removed manually from the index Attempts are under way to switch to the dynamic hierarchical MDS; not yet stable due to the Globus bugs 2002-11-11 [email protected]

7

Other minor problems

      Installation of ATLAS software:   Cyclic dependencies External dependencies, esp. on

system

software Authentication & authorization, users and services   EDG can’t accept instantly a dozen of new country Certificate Authorities Default proxy lives only 12 hours – users keep forgetting to request longer ones to accommodate long jobs Documentation  Is abundant and not very much user-oriented  Things are improving as more users are coming Information system  faulty information providers, affecting brokering  very difficult to browse/search and retrieve relevant info Data management   information about existing file collections is not easy to find management of output data is mostly manual (can not be done via JDL) General instability of most EDG services 2002-11-11 [email protected]

8

Achievements:

      A team of hard-working people across the Europe (ATLAS VO is 45 members strong as of today) ATLAS software (starting from release 3.2.1) is packed into relocateable RPMs, distributed and validated elsewhere DC1 production script is “gridified”, submission script is produced User-friendly testbed status monitor and ATLAS VO information page are deployed 5 Dataset 2000 input files are replicated to 5 sites (2 @ each) Two production-style tests completed:  100 first partitions of the Dataset 2000 are processed  Other (smaller) datasets: 4 input files (ca 400 MB each) replicated to 4 sites; 250 jobs submitted, adjusted to run ca 4 hours each. The jobs were distributed across all the testbed by the Resource Broker 2002-11-11 [email protected]

9

Summary

Environment Testbed 1.2.0

Testbed 1.2(.1)

only CERN site is available, GASS Cache “unfixed”

Testbed 1.2.2

All the core sites have GASS Cache “unfixed”

Testbed 1.3

A.K.A. “The Showstopper” release

Success/failure rate

Job execution Data management

GASS Cache problems, 100% failure Big file replication fails (GridFTP timeout); no CASTOR support Half of the Dataset 2000 jobs are executed, 100% success 400 short jobs are executed across the testbed; the rest of the Dataset 2000 jobs proceeded with > 50% re submission rate To be tested (GASS Cache is expected to be fixed) Not applicable (only one site is used) Short files are replicated everywhere; longer files are copied manually (GridFTP not fixed) To be tested (GridFTP is expected to be fixed) 2002-11-11 [email protected]

10

What next

     Testbed 1.3 is available for testing (not on production site yet) from

today

Precise quantification of failure/success rate using Dataset 2000 partitions to be done on Testbed 1.3

o ATLAS DC1, pile-up: the runtime environment is ready, scripts are prepared Testbed feature: the “old” runtime environment (3.2.1) has to be

replaced

with a new one (4.0.1) CASTOR-EDG interface has to be tested; GridFTP server on CASTOR is expected to arrive soon Some ATLAS production sites may join the EDG Testbed soon 2002-11-11 [email protected]

11

Emanuele Leonardi Cal Loomis Fairouz Malek-Ohlsson Gonzalo Merino Armin Nairz Giudo Negri Steve O'Neale Laura Perini Gilbert Poulard Alois Putzer Di Qing Mario Reale David Rebatto Zhongliang Ren Silvia Resconi Alessandro De Salvo Markus Schulz Massimo Sgaravatto Oxana Smirnova Chun Lik Tan Jeff Templon Stan Thompson Luca Vaccarossa Peter Watkins