User Board Overview Dan Tovey University Of Sheffield Dan Tovey, University of Sheffield.

Download Report

Transcript User Board Overview Dan Tovey University Of Sheffield Dan Tovey, University of Sheffield.

User Board Overview
Dan Tovey
University Of Sheffield
Dan Tovey, University of Sheffield
Tier-1 Planning
• Quarterly UB meeting in April (see minutes) updated
Tier-1 planning figures
• Shortfall of T1 resources in future years, (especially
2008) evident.
• Will need to consider if expt. requirements can be met
by Tier-2 resources  need to demonstrate clear need
for Tier-1 functionality.
• Requests which can be met by Tier-2 to be discussed
with Tier-2 board.
• ‘Other Experiments’ line removed from Tier-1
Schedule following detailed Tier-1 board plan  all
users must make representation to UB to get access to
resources
Dan Tovey, University of Sheffield
Tier-1 Planning
• Tier-1 utilisation figures frequently fall
significantly short of both requests and allocations
– sends the wrong message
– Often not fault of experiments (e.g. middleware /
operational problems) but experiments must work to
produce more realistic estimates
• Move to strict allocation of Disk resources (no
over-allocation)  helps Tier-1 team.
• Also synchronise with spending cycle  aim to
ensure complete use of all new resources as soon
as on-line
Dan Tovey, University of Sheffield
DB Links
• Stronger links with Deployment Board are
seen as vital  standing invitation for DB
representation at UB meetings.
Dan Tovey, University of Sheffield
UB Concerns
 How are experiments that globally are not
moving to the Grid to be handled?
 Site stability & User support
 Balance of effort at Tier-1: much used for CMS
(SRM) and later LCG SC, but what about smaller
user communities?
 What about ‘non-standard’ OS at Tier-2 sites 
can render useless to some experiments. UB
and Tier-2 board need to persuade to work
towards standardisation.
Dan Tovey, University of Sheffield
Questionnaire
• User Board questionnaire updated for latest OsC
process.
• No big changes from February
• Some new comments/concerns:
–
–
–
–
fragmented support structure
All stick and no carrot
held up by problems with establishing the VO
Not all experiments supported by large Tier-2s
• Further details at:
– http://www.gridpp.ac.uk/eb/workdoc/gridusebyexpts_
0605.doc
Dan Tovey, University of Sheffield
Pleasure: LHCb
Shared data (LHCb RTTC production May/June)
Countries
Events produced
UK
60 M
Italy
42 M
Swiss
23 M
France
11 M
Netherland
10 M
Spain
8M
Russia
3M
Grece
2.5 M
Canada
2M
Germany
0.3 M
Belgium
0.2
Sweden
0.2 M
Romany,Hungary,Brasil,USA
0.8 M
The data reported are
preliminary (accuracy
at 5%)
5% produced with plain
DIRAC sites
95% produced with LCG
sites
Dan Tovey, University of Sheffield
Pleasure: ATLAS
• Using the Grid for
100% of Simulation,
Digitisation and
Reconstruction.
• 8.5M fully
simulated ATLAS
events produced
• 20% of LCG jobs in
UK
• Overall throughput
good, and
improving …
Dan Tovey, University of Sheffield
Pain: ATLAS
• But … experience has been painful!
• Significant throughput problems experienced in
January/February
– production goals descoped (15M events planned vs. 8.5M ev. actual).
• Identified problems (highlights – see also questionnaire):
– System appears to function best when only one person submitting jobs!
– Lack of a distributed mechanism for prioritising jobs
– Lack of inter-operability between LCG and other Grids: load balancing and
data replication have to be done 'by hand'. Leads to production errors (e.g.
same sample produced multiple times on different grids)
– Too much human intervention required to set, adjust and enforce priorities
– Could not saturate CPU resources on LCG easily (rate doubled with a simple
change of scripts/person!): production time does not scale with cpu
requirements
– Job definition/submission very (expert) labour intensive
– Absolute need for a SE/SRM solution for small files.
– Urgent need for VOMS, integrated with other grid tools for resource
allocation/access/monitoring/accounting
Dan Tovey, University of Sheffield
H1 Tests
Dan Tovey, University of Sheffield
H1 Tests
30 Jobs failed: 22 due to Grid problems (gridproxy/misc.)
Dan Tovey, University of Sheffield
H1 Tests
Dan Tovey, University of Sheffield