Review of CERN Tier-2 Meeting

Transcript Review of CERN Tier-2 Meeting

Review of WLCG Tier-2 Workshop
Duncan Rand
Royal Holloway, University of London
Brunel University
....from the perspective of a Tier-2 system manager

Workshop 3 days – lectures from experiments

Tutorial 2 days – parallel programme

Lots of talks with lots of detail!

General overview - refer to original slides for details

Oriented towards ATLAS (RHUL) and CMS (Brunel)
What did I expect?


An overview of the future

the big picture

more details about the experiments

data flows and rates

how were they going to use the Tier-2 sites?

what did they expect from us?
Perhaps, a tour of the LHC or an experiment
What do the experiments have in common?

Large volume of data to analyse (we knew that)
Need to distribute data to CPU’s, keep track of it, analyse it
and upload results


However, also need to run lots of Monte Carlo (MC) jobs

common to all particle physics experiments

large fraction of all jobs run (ATLAS:1/3; CMS:1/2)

submitted from a central server – 'production'

explains mysterious 'prd' users e.g. lhcbprd running on our
Tier-2 now
What do they do in Monte Carlo production?
Start with small dataset (KB) with initial conditions describing
experiment


Model experiment from collision to analysis

Model proton-proton interactions, detector physics etc..

CPU intensive; about 10 kSI2k hours

Upload larger data-set to Tier-1 at the end
Relatively low network demands; steady data flow from Tier-2
to Tier-1 of about 50Mbit/s (varies for each expt.)

Data Management
Data is immediately transferred from Tier-0 to Tier-1's for
backup

RAW data is first calibrated and reconstructed to give Event
Summary Data (ESD) and Analysis Object Data (AOD) suitable
for analysis

AOD data sets transferred to Tier-2's for analysis – ‘bursty’
depending on user needs, ~300 Mbit/s (varies for each expt.)


Tier-1’s will provide reliable storage of data

Tier-2’s act more like dynamic cache
Tier-1’s handle more or less of essential services such as file
catalogues, FTS services etc.

Computing

Experiments have developed complex software tools to:





handle all this data transfer and keep track of datasets
(CMS:PhEDEx, ATLAS: DDM)
handle submission of MC production
(CMS: ProdManager/ProdAgent)
direct jobs to where the datasets are
enable physicist in office to carry out ‘chaotic user
analysis’ (doesn’t describe their mode of work, more the
lack of central submission of jobs) (CMS:CRAB)
these make more or less demands on a site
ALICE
Alice - not highly relevant to UK as only supported by
Birmingham at Tier-2 level


Distinction between Tier-1 and Tier-2 is by Quality of Service
Require extra VO box installed at a site; unlikely to use nonAlice Tier-2's opportunistically?

Developing ‘parallel root facility’ (PROOF) clusters at Tier-2’s
for faster interactive data analysis

LHCb
Not going to use Tier-2's for analysis of data – concentrate
analysis at Tier-1


Only going to run Monte Carlo jobs at Tier-2's

Simplifies data transfer requirements at Tier-2 level

So, easiest for a Tier-2 to support
Low networking demands: 40Mbit/s aggregated over all
Tier-2’s

UKI-LT2-Brunel (100 Mbit/s) recently in top 10 providers for
LHCb Monte Carlo

ATLAS
Tier-2's provide 40% of total computing and storage
requirements


Hierarchical structure between Tier-1's and Tier-2's

a Tier-1 provides services (FTS, LFC) to group of Tier-2's

no extra services required at Tier-2 level
Tier-2's will carry out MC simulations - results sent back to
Tier-1's for storage and further distribution and processing –
steady 30Mbit/s from site

AOD (analysis object data) will be distributed to Tier-2's for
analysis: 160Mbit/s to site

SC4: how long to analyse 150TB data equivalent to 1 year
running of LHC?

CMS


CPU intensive processing mostly carried out at Tier-2’s
Tier-2’s run 50% MC and 50% analysis jobs
MC production jobs handled by central queue called
‘ProductionManager’



submit, track jobs and register output in CMS databases

jobs handed to ProductionAgents for processing
MC job output does not go from WN to Tier-1 directly


data is stored locally and small files are merged together
by new jobs (heavy I/O)
large file (~TB) returned to Tier-1
CMS
Importance of good LAN bandwidth from WN’s to SE to do
this merging of files

Use ‘CRAB’ (CMS Remote Analysis Builder) at a UI to
analyse data


User specifies dataset

CRAB ‘discovers’ data, prepares job and submits it
‘Surviving the first years’; until detector is understood AOD’s
not that useful - will rely heavily on raw data – large networking
demands

CMS: requirements of Tier-2 site


Division of labour

CMS look after global issues

Tier-2 look after local issues to keep site running
What is required:

a good batch farm with reliable storage

good LAN and WAN networking

install PhEDEx, LFC and Squid cache (calibration data)

pass Site Functional Tests

a good Tier2 is ‘active, responsive, attentive, proactive’
Support and operations afternoon

Discovered that
WLCG = EGEE + OSG
i.e. we are now working more closely with the US Open
Science Grid

OSG not too relevant for the average Tier-2 sys-admin in UK
UKI ROC meeting

Small room, face to face meeting – lots of discussion
Grumbles about GGUS tickets and time taken to close solved
ticket


Close it yourself add ‘status=solved’ to first line of reply
Highlighted for me the somewhat one-directional flow of
information in the workshop itself

Would have been good for Tier-2’s to have been able to
present at the workshop

Middleware tutorials

Popular – lots of discussion
Understandable given fact that Tier-2 system admins more
interested in middleware than experimental computing models

Good to be able to hear roadmap for LFC, DPM, FTS, SFT’s
etc. from middleware developers and ask questions

Tier-2 interaction

Didn't appear to be much interaction between Tier-2's
Lack of name badges?
 Missed chance to find out how others do things

Michel Jouvin from GRIF (Paris) gave a summary of his
survey on Tier-2’s


large variation between resources at Tier-2’s

1 to 8 sites per Tier-2; 1 to 13 FTE!

Difference between distributed vs. federated Tier-2’s?

Post-workshop survey excellent idea
Providing a Service

We are the users and customers of the middleware

Tier-2 providing a service for experiments
➢
CMS: ‘Your customers will be remote users’

Tier-2's need to generate a customer service mentality

Need good communication paths to ensure this works well
CMS have VRVS integration meetings and email list – sounds
promising

Not very clear how other experiments will communicate proactively

Summary

Learnt a lot about how the experiments intend to use Tier-2's

Pretty clear about what they need from Tier-2 sites

Could have been more feedback from Tier-2’s

Could have been more interaction between Tier-2’s

Tier-2’s are critical to success of LHC: service mentality

Communication between experiments and Tier-2’s unclear
The LHC juggernaut is changing up a gear !