Transcript Document

GridPP3
David Britton 6/September/2006
6/September/2006
GridPP3
D. Britton
Overview
The GridPP3 proposal consists of a 7-month extension to GridPP2, followed by a
three year GridPP3 project starting in April 2008.
GridPP2+ (7 month extension from September 2007 to March 2008)
- Early approval sought in order to ensure staff retention.
- Provides continuity of management and support over the LHC start-up.
- Aligns the project with (a) financial year; (b) EGEE and other EU projects.
GridPP3 (3 year project from April 2008 to March 2011)
- “From production to exploitation.”
- Delivers large-scale computing resources in a supported environment.
- Underpins the success of the UK contribution to the LHC.
6/September/2006
GridPP3
D. Britton
Global Context
Developing requirements
Evolving standards
LCG
EDG
EGEE
S(Many)
GridPP
Changing Costs and budgets
Experience
EDG
2001
2002
6/September/2006
EGEE-I
2003
GridPP1
2004
2005
2006
GridPP2
GridPP3
EGI ?
EGEE-II
2007
LHC Data Taking
GridPP3
D. Britton
WLCG MoU
• 17 March 2006: PPARC signed the
Memorandum of Understanding
with CERN
• Commitment to UK Tier-1 at RAL
and the four UK Tier-2s to provide
services and resources
• Current MoU signatories:
China France Germany Italy India Japan
Netherlands Pakistan Portugal Romania
Taiwan UK USA
• Pending signatures:
Australia Belgium Canada Czech Republic
Nordic Poland Russia Spain Switzerland
Ukraine
6/September/2006
GridPP3
D. Britton
Grid Overview
Aim: by 2008 (full year’s data taking)
- CPU ~100MSi2k (100,000 CPUs)
- Storage ~80PB
- Involving >100 institutes worldwide
- Build on complex middleware in
Europe (Glite) and in the USA (VDT)
1. Prototype went live in September
2003 in 12 countries
2. Extensively tested by the LHC
experiments in September 2004
3. 197 sites, 13,797 CPUs, 5PB storage
in September 2005
4. 177 active sites, 26,527 CPUs, 10PB
storage in September 2006
6/September/2006
GridPP3
D. Britton
Tier-0 to Tier-1
• worldwide data
transfers >
950MB/s for 1
week
• peak transfer
rate from CERN
of >1.6GB/s
• Ongoing
experiment
transfers as part
of current service
challenges
6/September/2006
GridPP3
D. Britton
Tier-1 to Tier-2
• UK data transfers >1000Mb/s for 3 days
• peak transfer rate from RAL of >1.5Gb/s
• Require high data rate transfers (300-500Mb/s)
to/from RAL as a routine activity
6/September/2006
GridPP3
D. Britton
It’s in use:
Active Users by LHC experiment
ALICE (8)
ATLAS (70)
CMS (150)
LHCb (40)
6/September/2006
GridPP3
D. Britton
Tier Centres
Tier-0
Tier-1
Tier-2
ALICE
Reconstruction
On-demand
analysis
Central
simulation
On-demand
analysis
ATLAS
Reconstruction
Scheduled
analysis/
skimming
Calibration
Simulation
On-demand
analysis
Calibration
Reconstruction
Scheduled
analysis/
skimming
Simulation
On-demand
analysis
Calibration
Reconstruction
On-demand
analysis
Scheduled
skimming
Simulation
CMS
LHCb
6/September/2006
GridPP3
First-pass
scheduled
reconstruction
D. Britton
LHC Hardware
Requirements
Resource
Tier-1 CPU
[MSI2k]
Required
Non-UK Pledged
GridPP3
Tier-1 Disk
[PB]
Required
Non-UK Pledged
GridPP3
Tier-1 Tape
[PB]
Required
Non-UK Pledged
GridPP3
Tier-2 CPU
[MSI2k]
Required
Non-UK Pledged
GridPP3
Tier-2 Disk
[PB]
Required
Non-UK Pledged
GridPP3
6/September/2006
ALICE
12.3
54%
0.16
1%
7.4
36%
0.11
1%
6.9
45%
0.10
1%
ATLAS
24.0
89%
3.00
13%
14.4
81%
1.78
12%
9.0
90%
1.12
12%
CMS
15.2
73%
1.56
10%
7.0
75%
0.84
12%
16.7
53%
1.44
9%
LHCb
4.4
85%
0.74
17%
2.4
77%
0.41
17%
2.1
74%
0.35
17%
14.4
41%
0.18
1%
3.5
39%
0.05
1%
19.9
83%
2.66
13%
8.7
63%
1.14
13%
19.3
90%
1.80
9%
4.9
92%
0.40
8%
7.7
38%
2.17
28%
n/a
n/a
n/a
n/a
GridPP3
ALICE: Based on UK M&O
author fraction (1.2%).
ATLAS: Based on UK
fraction of Tier-1 Authors.
CMS: Based on a threshold
size for a minimum viable
Tier-1.
LHCb: Based on Authorship
fraction (16.5%) and
number of Tier-1s.
Overall resource level
reviewed by LHCC.
Balance of CPU, Storage,
and Network driven by
computing models.
D. Britton
Non-LHC Hardware
Requirements
BaBar: Included explicitly, based on well understood resource requirement
per fb-1 and the expected luminosity profile up to October 2008. Level is
~15% of Tier-1 CPU and Tape, and 9% Disk in 2008.
UKQCD: Request received after planning stage completed so not included in
the model. (Some uncertain whether UKQCD will move to LCG-based Grid
and how manpower would be funded). Level is 3%-4% of Tier-2 resources and
~7% of Tier-1 tape in 2008.
Others: The requirements of other, smaller, user groups and some provision
for future larger groups (LC, Neutrino) where the requirements are currently
largely unknown, have been addressed with a 5% envelope allocation of Tier2 Disk and CPU, and Tier-1 Tape.
6/September/2006
GridPP3
D. Britton
Budget Overview
Cost Table
Tier-1
Tier-1
50%
Travel+Other
2%
Outreach
1%
Management
3%
6/September/2006
Tier-2
25%
Operations.
6%
Support
13%
GridPP3
Staff
Hardware
Tier-2
TOTAL [£m]
4.99
11.72
Staff
3.29
Hardware
5.12
Grid Support Staff
4.50
Grid Operations Staff
1.89
Management
1.17
Outreach
0.37
Travel and Other
0.84
Project Sub-Total
33.90
Working Allowance (4%)
1.25
Project Cost
35.15
Contingency (12%)
4.15
Tier-1 Running Costs
2.50
Full Approval Cost
41.80
D. Britton
Tier-1 Centre
Tier-1 CPU capacity by expt
Defined by the experiment
hardware requirements,
the experiment computing
models, a hardware costing
model, and by the service
levels defined in the
international MOU signed
by PPARC
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
A LI C E
A T LA S
BA BA R
C MS
LH C b
O thers
2008
2010
2011
2012
Tier-1 disk capacity by expt
10000
A LI C E
A T LA S
BA BA R
C MS
LH C b
O thers
8000
6000
4000
2000
0
2008
T1 to T1
(T1 to
T0 to T1
T2 to T1 T1 to T2
T0)
CPU to
disk
Disk to
CPU
Disk to
tape
Tape to
disk
ATLAS
940
2361
264
34
610
165
25
105
CMS
423
1590
242
45
240
240
58
360
LHCb
212
278
63
54
184
184
3
0
Total
1130
4229
505
133
752
589
86
465
Estimated Tier-1 peak data flows in 2008 [MB/s]
6/September/2006
2009
2010
2011
2012
Tier-1 tape capacity by expt
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
A LI C E
A T LA S
BA BA R
C MS
LH C b
O thers
2008
GridPP3
2009
2009
2010
2011
D. Britton
2012
Tier-1 Centre:
Service Level
Service
Maximum delay in responding to operational
problems
Service
interruption
Acceptance of data from
the Tier-0 Centre
Networking service to
the Tier-0 Centre during
accelerator operation
Data-intensive analysis
services, including
networking to Tier-0,
Tier-1 Centres
All other services –
prime service hours
All other services – other
times
6/September/2006
12 hours
Degradation of the
Degradation of the
capacity of the service capacity of the service
by more than 50%
by more than 20%
12 hours
24 hours
Average availability
measured on an annual
basis
During
accelerator
operation
99%
At all other times
n/a
12 hours
24 hours
48 hours
98%
n/a
24 hours
48 hours
48 hours
98%
98%
2 hour
2 hour
4 hours
98%
98%
24 hours
48 hours
48 hours
97%
97%
GridPP3
D. Britton
Tier-1 Centre: Staff
Core services refer to user-file
systems, monitoring, software
deployment and conditions
database.
GridPP3
Work Area
PPARC
funding
CCLRC
funding
CPU
2.0
0.0
Disk
3.0
0.0
Tape and CASTOR
2.0
1.3
Core Services
1.0
0.5
Operations
3.0
1.0
Incident Response
Unit
3.0
0.0
Networking
0.0
0.5
Deployment
1.5
0.0
Experiments
1.5
0.0
Tier-1 Management
1.0
0.3
18.0
3.6
Totals
6/September/2006
Operations refers to machine-room
environment, hardware
diagnostics/repair, automation,
fabric management, tapemovement etc.
Incident Response Unit addresses
MOU service requirement including
out-of-hours call out.
GridPP3
D. Britton
Tier-2 Centres
GridPP has successfully developed four
distributed Tier-2 Centres which have:
- Engaged the institutes;
- Levered large amounts of resources;
- Developed local expertise;
- Stimulated cross-disciplinary
relationships;
- Help promote the Grid, GridPP,
Particle Physics, and the local groups
within the universities.
Successes: Development of regional
management structure; MOU signed by
each institute with GridPP; deployment
of complex middleware; accounting;
security; data-transfers; all fully
operational and contributing to LCG.
6/September/2006
GridPP3
D. Britton
Tier-2 Centres
To match the LHC computing models around 50% of the UK computing
resources will be located at the Tier2s. Service levels are not as demanding
as at the Tier-1.
Service
Maximum delay in responding to
operational problems
Prime time
Other periods
End-user analysis facility
Other services
2 hours
12 hours
Average availability
measured on an
annual basis
72 hours
72 hours
95%
95%
Distributed nature of the UK Tier-2 has technical advantages (“divide and
conquer”) and technical drawbacks (“inefficiencies”). Importance of
political/social aspects should not be underestimated.
Tier-2 total CPU capacity by expt
Tier-2 total disk capacity by expt
14000
7000
12000
6000
ALICE
ATLAS
BABAR
CMS
LHCb
Others
UKQCD
10000
8000
6000
4000
4000
3000
2000
1000
2000
0
6/September/2006
2008
2009
ALICE
ATLAS
BABAR
CMS
LHCb
Others
UKQCD
5000
2010
2011
2012
GridPP3
0
2008
2009
D. Britton
2011
2010
2012
Tier-2 Market Model
1) Assume all Institutes involved are interested in building on their
current contribution so that…
2) Effectively a “market” exists to provide Tier-2 resources to HEP
(because many Institutes have dual-funding opportunities and/or
internal reasons to be involved).
3) GridPP offers a market-price for Tier-2 resources which institutes may
or may not chose to accept.
4) The market price is adjusted to optimise resources obtained.
5) The market price is bounded by what it would cost to provision the
resources at the Tier-1.
Inefficiencies associated with the distributed nature of the Tier-2s
may be balanced by an increase in competition/leverage.
6/September/2006
GridPP3
D. Britton
Tier-2 Hardware
Allocations
Fraction of Experiment allocated to each Tier-2
ATLAS
CMS LHCb Other
London
0.25
0.75
0.10
0.30
NorthGrid
0.50
0.00
0.20
0.40
ScotGrid
0.15
0.00
0.30
0.10
SouthGrid
0.10
0.25
0.40
0.20
Constrained by the requirement for
Institutional JeS forms GridPP made an
initial mapping (or allocation – i.e. not
quite the “market” approach intended)
of Tier-2 hardware.
Relative fraction of Experiment allocated to each Institute within the Tier-2
ATLAS
CMS
LHCb
Other
ATLAS
CMS
LHCb
Other
0.40 0.00 0.00 0.30
Brunel
0.00
0.10
0.00
0.15 Birmingham
Imperial
0.00
0.90
1.00
0.00 Bristol
0.00 0.50 0.25 0.20
QMUL
0.70
0.00
0.00
0.60 Cambridge
0.25 0.00 0.25 0.15
RHUL
0.20
0.00
0.00
0.15 Oxford
0.25 0.00 0.25 0.25
UCL
0.10
0.00
0.00
0.10 RAL PPD
0.10 0.50 0.25 0.10
Lancaster
0.50
0.00
0.00
0.50 Durham
0.10 0.00 0.10 0.25
Liverpool
0.20
0.00
1.00
0.15 Edinburgh
0.00 0.00 0.40 0.25
Manchester
0.20
0.00
0.00
0.20 Glasgow
0.90 0.00 0.50 0.50
Sheffield
0.10
0.00
0.00
0.15
6/September/2006
GridPP3
Allocations based
on past-delivery;
current size; and
size of the local
community of
physicists.
D. Britton
Tier-2 Staff Allocations
GridPP currently funds 9 FTE at 17 institutes. In GridPP3, this is
proposed to increase to 14.75 FTE (c.f. Tier-1 has 18 FTE funded by
GridPP3 for a comparable amount of hardware). Again, in this market
approach this is the “effort (currently) offered” and not an estimate of
the “full effort needed”.
London
FTE NorthGrid
FTE ScotGrid
FTE SouthGrid
Brunel
0.50 Lancaster
1.50 Durham
0.25 Birmingham 1.00
Imperial
1.50 Liverpool
1.00 Edinburgh 0.50 Bristol
QMUL
1.00 Manchester 1.50 Glasgow
RHUL
0.50 Sheffield
UCL
1.00
Total
4.50
6/September/2006
5.00
GridPP3
1.00
1.00 Cambridge
1.00
1.75
FTE
0.50
Oxford
0.50
RAL PPD
0.50
3.50
D. Britton
Tier-2 Hardware Costs
(Agreed by CB)
CPU (KSI2K)
2007
2008
2009
2010
2011
2012
7560
10215
14522
18203
21708
1559
2106
2994
3753
4476
£0.392k
£0.312k
£0.247k
£0.175k
£0.124k
£0.087k
£612k
£656k
£740k
£656k
£553k
£0k
£1,163k
£1,295k
£1,383k
£1,282k
£1,120k
£0k
Requirement
Amount paid for
Unit Cost
Cost £k
Total (inc Disk)
• Take requirement in following year 7560 divided by the
lifetime in years (4.85 CPU, 3.9 Disk) = 1559
• Multiply by the unit cost in that year £0.392k/KSI2K =
£612k
• Similarly for disk.
• Up to institutes how they spend it (new kit, replacement
kit, central services … )
6/September/2006
GridPP3
D. Britton
Tier-2 Resources
Sanity Checks:
1) Compare cost to GridPP of
hardware at the Tier-1 and Tier-2 CPU (K£/KSI2K-year):
integrated over the lifetime of the DISK (K£/TB-year):
TAPE (K£/TB-year):
project.
2) Total cost to project: Can compare
(Staff + Hardware) cost of the
Tier-2 facilities with the cost to
the project of placing the same
hardware at the Tier-1 (assuming
that doubling the Tier-1 hardware
requires a 35% increase in staff).
6/September/2006
GridPP3
Tier-1
0.070
0.144
0.052
Tier-2
0.045
0.109
Including staff and hardware, the
cost of the Tier-2 facilities is ~80%
of cost of an enlarged Tier-1.
D. Britton
Budget Overview
Cost Table
Tier-1
Tier-1
50%
Travel+Other
2%
Outreach
1%
Management
3%
6/September/2006
Tier-2
25%
Operations.
6%
Support
13%
GridPP3
Staff
Hardware
Tier-2
TOTAL [£m]
4.99
11.72
Staff
3.29
Hardware
5.12
Grid Support Staff
4.50
Grid Operations Staff
1.89
Management
1.17
Outreach
0.37
Travel and Other
0.84
Project Sub-Total
33.90
Working Allowance (4%)
1.25
Project Cost
35.15
Contingency (12%)
4.15
Tier-1 Running Costs
2.50
Full Approval Cost
41.80
D. Britton
Grid Support
Refers to staff effort for the support of Middleware, Security and
Networking areas in GridPP3. The emphasis is on a managed transition
from middleware development to middleware support (operational and
bug-fixing).
Three criteria applied to guide prioritisation of areas for support:
1) Areas which are “mission critical” for the UK.
2) Areas which are viewed as established “international obligations”.
3) Areas which provide significant leverage to the obvious advantage
of GridPP
Background documents discuss areas in terms of:
a) Operational Support
b) Maintenance (bug-fixing)
c) Development (phased out where practical).
6/September/2006
GridPP3
D. Britton
Grid Support Areas
Area
Grid Data
Management
Storage
Information &
Monitoring
Role
Operational Service
Support
FY08 FY09 FY10
Metadata
Replica Management
Castor Support
DPM Support
dCache Support
Operational Service
Support
1.0
1.0
1.0
1.0
1.0
1.0
R-GMA & Support
Service Discovery
Support
3.0 1.5 0.6
0.5 0.3 0.2
GLUE & International
Collaboration
0.5 0.2 0.2
1.0 1.0 1.0
1.0
1.0
1.0
1.0
0.5
1.0
1.0
1.0
1.0
0.5
0.5
1.0
Workload,
Performance &
Portal
Security
Networking
Performance
Testing
Real-Time Monitoring
Portal Support
Operational Security
Officer
1.0
0.5
0.5
0.5
1.0
0.5
0.5
0.5
0.5
1.0
0.5
0.5
0.5
0.5
1.0
GridSite Support
VOMS Support
International Security
Co-ordination
1.5
0.5
0.8
1.5
0.5
0.8
1.5
0.5
0.8
Requirements &
Provisioning
0.5
0.5
0.5
Performance
Monitoring
0.5
0.5
0.5
TOTALS
6/September/2006
GridPP3
18.3 15.3 13.8
D. Britton
Grid Support Staff
Evolution
GridPP2
GridPP2
GridPP
WLMS
Portal
Data
Management
Data Storage
Security
InfoMon
Network
HP Post
GRAND TOTAL
6/September/2006
Tier-2 Expert
MSN
Apps. Interface
Tier-2 Expert
MSN
Tier-2 Expert
MSN
Tier-2 Expert
MSN
Tier-2 Expert
MSN
Tier-2 Expert
MSN
Tier-2 Expert
EGEE
1.0
1.0
1.0
1.0
1.0
1.0
2.0
1.5
3.5
0.0
3.5
3.5
0.5
1.5
2.0
0.5
24.5
GridPP2+
GridPP
EGEE
1.0
1.0
1.0
1.0
1.0
1.0
2.0
1.5
3.5
0.0
3.5
3.5
0.5
1.5
2.0
0.0
24.0
GridPP3
GridPP3
FY08
FY09
GridPP3
FY10
2.5
2.0
2.0
Workload
Performance
and Portal
3.0
3.0
3.0
Data
Management
3.0
2.5
2.0
Storage
3.8
3.8
3.8
Security
5.0
3.0
2.0
InfoMon
1.0
1.0
1.0
Networking
0.0 0.0 0.0
18.3 15.3 13.8
D. Britton
Grid Operations
Team of 8.5 FTE consisting of:
- 1 Production Manager;
- 4 Tier-2 Coordinators;
- 3 to run the UK/GridPP Grid Operations Centre (GOC).
- 0.5 FTE to coordinate technical documentation.
Responsible for the deployment, operation, and support of UK Particle
Physics environment. Production Manager is responsible for resolving
technical and coordination issues that span the Tier1 and Tier2s and
ensuring a stable production services with appropriate upgrades to
improve functionality and quality.
The current GOC (5.5 FTE funded by EGEE) is responsible for monitoring
the world-wide Grid operations, providing trouble tickets, accounting
services, and administrative tools.
6/September/2006
GridPP3
D. Britton
Operations Posts
Area
Role
FY08 FY09 FY10
Grid
Deployment
Production Manager
1.0
1.0
1.0
Tier-2 Technical Coordinators 4.0
(one for each of the 4 regional
centres)
4.0
4.0
Technical Documentation
0.5
0.5
0.5
Monitoring of LCG operations in 1.0
the UK
1.0
1.0
Grid Accounting
1.0
1.0
1.0
International Coordination
0.5
0.5
0.5
Security Risk Management
0.5
0.5
0.5
8.5
8.5
8.5
Grid
Operations
TOTALS
6/September/2006
GridPP3
D. Britton
Budget Overview
Cost Table
Tier-1
Tier-1
50%
Travel+Other
2%
Outreach
1%
Management
3%
6/September/2006
Tier-2
25%
Operations.
6%
Support
13%
GridPP3
Staff
Hardware
Tier-2
TOTAL [£m]
4.99
11.72
Staff
3.29
Hardware
5.12
Grid Support Staff
4.50
Grid Operations Staff
1.89
Management
1.17
Outreach
0.37
Travel and Other
0.84
Project Sub-Total
33.90
Working Allowance (4%)
1.25
Project Cost
35.15
Contingency (12%)
4.15
Tier-1 Running Costs
2.50
Full Approval Cost
41.80
D. Britton
GridPP3 Structure
Wind
Review
Provision
Collaboration
Board (CB)
Project
Management
Board (PMB)
Deployment
Board (DB)
Utilisation
User Board
(UB)
React
Earth
6/September/2006
GridPP3
D. Britton
Fire
Water
Oversight
Committee
(OC)
Management Continuity
GridPP2 GridPP2+
Post
GridPP3
FY08 FY09 FY10 Post
TD
Project Leader
0.67
0.67
0.90
0.90
0.90 Project Leader
DB
Project Manager
0.90
0.90
1.00
1.00
1.00 Project Manager
SL
T2 Coordinator
0.50
0.50
DB Chair
0.30
0.30
0.40
0.40
0.40
UB Chair
0.00
0.00
0.25
0.25
0.25 UB Chair
Middleware
Coordinator
0,50
0.50
Application
Coordinator
0.50
0.50
0.40
0.40
0.40
CCLRC
Management
0.50
0.50
0.50
0.50
0.50 CCLRC
Management
Sub-Total
3.87
3.87
3.45
3.45
3.45
6/September/2006
GridPP3
Deployment
Coordinator
Technical
Coordinator
D. Britton
SP
Outreach
Currently a Dissemination and an Events Officer (1.5 FTE). Instructions in
the PPARC call include the statement:
“It is expected that a plan for collaboration with industry will be
presented or justification if such a plan is not appropriate.”
Therefore, broaden mandate to include industrial liaison without
increasing manpower but add 0.5 FTE to this area from current
documentation officer to handle user documentation and web-site
maintenance. Overall team of 2 FTE responsible for:
-Dissemination activities (news, press-releases, liaison with partners, etc.)
-Event organisation (demos, publicity, etc.)
-Industrial liaison (to be developed.)
-Basic user documentation and website maintenance.
6/September/2006
GridPP3
D. Britton
GridPP3 Posts
GridPP2
GridPP2
GridPP
Management
Tier-1
Tier-2
All Management posts
All Tier-1 Services
Hardware Support
Specialist Posts
Middleware
All MSN Posts
Applications
All Application Posts
Operations Manager
Operations
Tier-2 Coordinators
GOC Posts
Documentation
Dissemination
Documentation Officer
Dissemination + Events
TOTAL
6/September/2006
3.87
13.50
9.00
5.50
13.00
18.50
1.00
0.00
0.00
GridPP2+
EGEE
1.50
3.50
4.00
5.50
1.00
1.50
81.37
GridPP
3.87
16.00
9.00
5.00
13.00
1.00
1.00
0.00
0.00
EGEE
GridPP3
FY08
FY09
GridPP3
FY10
3.45 3.45 3.45
18.00 18.00 18.00
14.75 14.75 14.75
1.50
3.50 18.30 15.30 13.80
1.00 1.00 1.00
4.00 4.00 4.00 4.00
5.50 3.00 3.00 3.00
0.50 0.50 0.50
1.00
0.50 0.50 0.50
1.50
1.50 1.50 1.50
65.87
65.00 62.00 60.50
GridPP3
D. Britton
Management
Tier-1
Tier-2
Support
Operations
Outreach
TOTAL
Travel and Other Costs
Based on experience in GridPP2
we have budgeted £3.5k per FTE
per annum for travel, a reduction
of about 10%, to cover
collaboration meetings, national
and international conferences and
workshops, technical meetings,
management meetings, etc.
“Other Costs” of £15k per annum have been included for outreach expenses
and other operational expenses (licences, laptops, test machines, web
server, software etc).
6/September/2006
GridPP3
D. Britton
Total Costs [k£]
Work Package
A Tier-1
Staff
B Tier-2
D Operations
E Management
F Outreach
and
Other
Total
6/September/2006
FY08
FY09
FY10
Total
693.47 1384.21 1432.66 1482.80
4993
Hardware 3810.50 2621.22 3025.76 2265.54
11723
Staff
C Support
G Travel
Costs
FY07
147.84 1008.59 1047.87 1088.90
3293
Hardware 1163.24 1294.63 1382.60 1281.84
5122
695.81 1416.04 1232.93 1155.33
4500
43.34
592.72
614.84
637.85
1889
194.35
311.78
324.97
338.78
1170
59.97
99.72
103.58
107.61
371
134.48
242.50
232.00
226.75
836
6943.00 8971.41 9397.21 8585.39
33897
GridPP3
D. Britton
Risks
ID
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13
R14
R15
R16
R17
R18
R20
R22
R25
R26
R27
R28
R29
R30
R31
R32
R33
R35
R36
R37
R38
R39
R40
R41
R42
R43
R44
Alt-i-r
Name
Recruitment/retention difficulties
Sudden loss of key staff
Minimal Contingency
GridPP deliverables late
Sub-components not delivered to project
Non take-up of project results
Change in project scope
Bad publicity
External OS dependence
External middleware dependence
Lack of monitoring of staff
Withdrawal of an experiment
Lack of cooperation between Tier centres
Scalablity problems
Software maintainability problems
Technology shifts
Repitition of research
Lack of funding to meet LCG PH-1 goals
Conflicting software requirements
Hardware resources inadequate
Hardware procurement problems
LAN Bottlenecks
Tier-2 organisation fails
Experiment Requirements not met
SYSMAN effort inadequate
Firewalls interfere with Grid
Inablility to establish trust relationships
Security inadequate to operate Grid
Interoperability
Failure of international cooperation
e-Science and GridPP divergence
Institutes do not embrace Grid
Grid does not work as required
Delay of the LHC
Lack of future funding
Network backbone failure
Network backbone bottleneck
Network backbone upgrade delay
Inadequate User Support
6/September/2006
GridPP
Li Im Risk Li
2
1
2 2
4
1 3
3 1
2 3
3 2
3
3
6
1
2
1
1
LCG
MSN
Apps
Pro. Grid
Im Risk Li Im Risk Li Im Risk Li Im Risk
2
4
2 2
4
2 2
4
2 2
4
3
3
1 3
3
1 3
3
1 4
4
1 2
2
1 3
3
2 3
6
3 2
6
2
2
2 3
6
2 3
6
2 3
6
1
2
1 4
4
2 2
4
1 4
4
1
1
2 2
4
3
3
1 3
3
2 3
6
4
1
1
2
1
2
1
2
2
4
2
2
2
2
8
2
4
4
2
4
2
4 1
4
3 3
3 2
9
6
2
2
2
3
4
4
2 2
1 3
2 3
4
3
6
3 3
2 2
9
4
2 2
2 2
4
4
1 4
4
3
2 2
2 3
4
6
1 3
3 2
4 3
6
12
1
3
2
1
2
1
2
2
Name
Likelihood
(1-4)
Impact
(1,2,3,5)
Risk
(L x I)
4
8
4
6
4 2
3
3
3
3
2
4
3
3
3
9
6
3
4
4
6
6
3 3
2 3
9
6
1 3
3
Insufficient funding.
3
5
15
Present requirements.
PPRP to advise on strategic priorities.
2
Hardware costing.
Hardware prices don’t
fall as anticipated.
2
3
6
Delay if possible or de-scope if necessary.
3
Tier-2 market fails.
2
3
6
Increase Tier-2 hardware price and/or Tier-2
staffing level.
4
Tier-1 fails to meet
service level required.
3
2
6
Increase Tier-1 staffing level.
5
Tier-2s fail to meet
service level required
3
2
6
Increase Tier-2 staffing level.
6
Middleware fails.
2
3
6
Mitigated by experiment specific solutions.
Work with partners to address shortcomings.
Re-target support effort.
7
Industrial take-up low.
3
1
3
Facilitated by Industrial Liaison post.
8
Outreach fails.
1
2
2
Appoint Dissemination Officer.
9
Staffing
missing/unqualified.
1
3
3
Build on existing expertise. Assume likelihood
is low if early approval of GridPP2 extension.
10
Organisational
problems.
1
3
3
Define/build/agree GridPP3 structure. Clarify
the role of GridPP3 and its interactions.
11
Technical Risks - See
GridPP2 risks: R9, R10,
R13, R14, R16, R22,
R25, R27, R36, Also,
physical risks.
*
*
*
Develop a full GridPP3 Risk Register based
on that from GridPP2.
12
Inadequate support
infrastructure.
2
2
4
Monitor performance of support activities via
pre-defined metrics.
13
Lack of interoperability.
2
2
4
Active engagement in NGS, GGF, WLCG,
EGEE.
14
Security compromise.
3
3
9
Work with other e-Infrastructure providers.
Limit capability through portals. Key part of
user training.
2
Adopt a conservative approach to technology
deployment.
8
2 3
6
3 3
9
0
2
1
1
4
2
4
4
1
4
4
4
GridPP3
Action to Mitigate Risk
1
6
2 1
2
4
2
2
1 4
2 2
#
D. Britton
Working Allowance and
Contingency
Item
Contingency [£m]
Working Allowance
[£m]
0.478 (d)
Tier-1 Staff
Tier-1 Hardware
1.758 (a)
Tier-2 Staff
0.853 (b)
Tier-2 Hardware
1.537 (c)
0.426 (e)
Management Staff
0.049
Operations
0.300
TOTAL
4.148
1.253
(a) 15% of Tier-1 HW (cost uncertainties).
(b) 4 FTE at Tier-2 (market approach).
(c) 15% of Tier-2 HW (cost uncertainties) + 15% (market approach).
(d) 2 FTE at Tier-1`(service level).
(e) 2 FTE at Tier-2`(service level).
6/September/2006
GridPP3
D. Britton
Total Project Cost
Work
Package
A
B
Cost Table
Tier-1
Tier-2
Staff
Hardware
Staff
Hardware
TOTAL
[£m]
4.99
11.72
Tier-2
25%
5.12
C
Grid Support Staff
4.50
D
Grid Operations Staff
1.89
E
Management
1.17
F
Outreach
0.37
G
Travel and Other
0.84
Project Sub-Total
33.90
Working Allowance
(4%)
1.25
Project Cost
35.15
Contingency (12%)
4.15
Tier-1 Running Costs
2.50
Full Approval Cost
41.80
6/September/2006
Tier-1
50%
3.29
Travel+Other
2%
Outreach
1%
GridPP3
Management
3%
Operations.
6%
D. Britton
Support
13%
Responses to Referee
Questions
6/September/2006
GridPP3
D. Britton
Exclusivity?
“There is clearly a compelling advantage for the physicists
concerned to be aligned with and pool resources with the rest of
the global alliance that comprises LCG. However, this does not
need to be an exclusive alliance.”
“long-term operational costs, quality of service and
interdisciplinary collaboration could surely be improved by a
much more integrated and synergistic approach.”
• GridPP has engaged with wider community (and has
reported this to PPARC through RCUK annual reports)
• GridPP’s first Grid application was GEANT-based for LISA
• Community is however focussed on its scientific
priorities: LHC start-up timescale provides the primary
focus
6/September/2006
GridPP3
D. Britton
Outsourcing?
“companies are developing expertise in service hosting
and provision with many opportunities to develop
experts, teams, resource management systems and
operational/business knowledge.”
• GridPP has engaged with BT (visits to hosting site
in St Albans, meeting with BT management at IC)
and discussed possibilities fully in the past.
• Recent IT outsourcing exercises at Bristol and
Cambridge indicate that costs are prohibitive
(but that these may be offset by a joint PR
programme).
6/September/2006
GridPP3
D. Britton
Novel? Original?
Timely?
“novelty is entirely inappropriate when the goal is a
highly reliable, ubiquitous and always available eInfrastructure”
“similar undertakings of various scales are underway in
many countries”
• GridPP notes that many of the methods used have
not been tested at the required scale
“The LHC is likely to start producing data by 2007 and
the proposed e-Infrastructure must be ready by that date
if UK PP is to benefit from that data.”
6/September/2006
GridPP3
D. Britton
Relationships?
“the PP grid community has not yet engaged in
collaboration on standardising data replication,
data location, caching and reliable data
movement services.”
• Globus RLS was based on earlier
collaboration with EDG, inc. GridPP input
• GridPP plans to include higher level
replication services, built on current
expertise
6/September/2006
GridPP3
D. Britton
Reliable methods?
“In house development of middleware and tools is
almost certainly inappropriate”
•
GridPP agrees and, hence, the focus is on
support and maintenance of existing
components, with planned reductions in
manpower
•
Appendix A2 Middleware Support Planning
Document expands upon the identified
components as either “mission critical” to UK
exploitation or as part of the UK’s input in the
wider international context or it is possible to
demonstrate leverage
6/September/2006
GridPP3
D. Britton
Industrial relevance?
• “significant technology transfer depends on
long-term and sustained collaboration
where mutual understanding develops and
co-adaptation follows”
• GridPP agrees: we are proposing a
dedicated 0.5FTE in this area and believe
this will represent good value at this level
6/September/2006
GridPP3
D. Britton
Viability?
“There is a significant risk that the gLite stack
will prove incapable of development for large
scale, wide-spread and production QoS use. It is
already very complex..”
• GridPP agrees that there is a risk, but the
expanded use of gLite across an everincreasing infrastructure indicates that
these problems are being overcome
“It is better than it was but it by no means free
from risk and misdirection.”
6/September/2006
GridPP3
D. Britton
Planning?
“The proposal states that “A future EGI project,
including particle physics as one of the leading
applications, may have started”. There are other future
scenarios. One is the model already used in GÉANT..”
• GridPP agrees that e.g. UKERNA could have been
asked to “manage the Grid”, but this is not
currently planned
• Our intention is to (continue to) engage fully with
the NGS and other bodies as discussed in
appendix A7 National Context Document
6/September/2006
GridPP3
D. Britton
Planning?
“I would strongly recommend that a production eInfrastructure project should not use bespoke software.”
• GridPP agrees – the reference was to experimentspecific code that is currently necessary to fill
gaps in the middleware
“It is essential to separate all forms of maintenance,
especially bug fixing and “improvements” from
operations and to conduct it in a software engineering
environment with strict quality controls, testing and
release procedures.”
• GridPP agrees – the quality controls, testing and
release procedures are of a high standard
6/September/2006
GridPP3
D. Britton
Planning?
“It is clear that a production service team should draw on others
who should develop such services, not develop them
themselves.” …
“It is probably necessary to carry on some aspects of the above
work, but these require very careful selection and they should be
collaborative with other disciplines and grid projects, and include
strategies where the development and maintenance is eventually
handed over to others.”
• GridPP agrees – in the GridPP3 proposal we discuss a very
limited subset of maintenance and support developments
that were proven to be necessary (and were effective) in
the past or can be envisaged to be required in future
c.f. “Storage management is an area where there is already good
international collaboration led by the PP community on standards
and implementations using the SRM specifications”
6/September/2006
GridPP3
D. Britton
Past effectiveness?
“The previous two GridPP projects have taken on
demanding and challenging engineering, management
and R&D tasks. They have been exceptionally
successful, as establishing and running grid services on
this scale requires world-leading innovation. This has
required professional leadership and adept
collaboration. There is plenty of evidence of their
ability and the advent of LHC data will guarantee their
motivation. Their particular strengths are in service
management, deployment and operation on a global
scale.”
• GridPP agrees
6/September/2006
GridPP3
D. Britton
Suitability
“The two previous GridPP projects have demonstrated
that they are capable of recruiting, sustaining and
managing such a multi-site team. There is likely to be a
substantial carry forward of the GridPP2 team. Can
you quantify the level of continuity that the project
depends on and the assessment of the risk that this
continuity will not be met?”
• GridPP agrees – there is a significant risk that the
current expertise will be lost due to planning
uncertainty. This was addressed in the proposal
by the request for early approval of the GridPP2
continuation component.
6/September/2006
GridPP3
D. Britton
Reduce number of Tier-2 sites?
“It might be helpful to review carefully whether long-term savings
can be made by concentrating Tier-2 resources over fewer sites.
Currently table 10 shows 17 sites for Tier-2 resources. Is there
really a case for resources at each of these sites?”
• All institutes have delivered on their past MoU
commitments (past performance was factored into the
proposed sharing of Tier-2 resources)
• If PPARC chose to invest at a small subset of sites, then
significant long-term buildings and infrastructure
investment would be required (that has not been planned)
• In addition the utility costs of these would be exposed
(currently hidden)
• If PPARC chose to select a larger subset of sites, there
would be limited gains
“Possibly leveraging SRIF funding is a consideration.”
6/September/2006
GridPP3
D. Britton
Cost-effectiveness
““matching funding” is not a justification”
(for 7-month GridPP2 continuation in the
context of EGEE-II)
• The main case is built upon GridPP2
completing its mission to establish a
Production Grid, prior to LHC data-taking
mode
• This enables retention of key staff whilst
planning for the Exploitation phase in
GridPP3
6/September/2006
GridPP3
D. Britton
Code efficiency
improvements?
“How do you trade between investing in software
engineering to improve code performance against
investing in more CPU?”
• LHC experiment codes are already highly
optimised for the complex data analysis required
• There is significant investment in the
optimisation effort within the experiments and
the requirements take into account future
optimisations
• The optimisations take account the (distributed)
Grid computing constraints
6/September/2006
GridPP3
D. Britton
Usage increases?
““use by a much larger community intent on individual analyses”
requires further justification. How do you demonstrate this
community will actually engage and actually generate this
additional use?”
• The experiment requirements anticipate increasing
analysis across the experiments
• This is quantified by experiment in the proposal
appendices
2.
3.
4.
5.
“ALICE Computing Technical Design Report”, lhcc-2005-018.pdf, 114pp.
“ATLAS Computing Technical Design Report”, lhcc-2005-022.pdf, 248pp.
“CMS: The Computing Project Technical Design Report”, lhcc-2005-023.pdf,
169pp.
“LHCb Computing Technical Design Report”, lhcc-2005-019.pdf, 117pp.
6/September/2006
GridPP3
D. Britton
Data Management?
“Companies such as Oracle and IBM supply
well-honed distributed database technologies
capable of high volume and high throughput.
Developing PP-specific and home grown
solutions is very unlikely to be cost effective.”
• Oracle are fully incorporated into LCG
planning, with (low cost) Worldwide Oracle
database services used for core metadata
functions
6/September/2006
GridPP3
D. Britton
Tier-2 additional support?
•
1.
2.
•
•
•
•
“Table 12 appears to identify an anomaly that suggests that the plan is not as
cost effective as it should be.”
Tier-2 support effort is currently cross-subsidised through:
the PP rolling grant programme;
Institute (e.g. computing service) support
Component 1 was anticipated not to be viable
Component 2 was modest, but is expected to continue at ~this level
We have requested Contingency to cover the possibility that
component 2 is not preserved (15% on the hardware cost in addition
to another 15% that covers the future price uncertainty; plus an
additional 4 FTE - 1 at each Tier-2)
We have also requested Working Allowance of an additional 2 FTE at
Tier-2s to be used if the service level falls short
6/September/2006
GridPP3
D. Britton
Context Planning?
“The development of this interdependency and
cooperation should be explicitly planned and specified
from the start of GridPP3.” e.g.
““forms part of the National e-Infrastructure” – what
part?”
““CA” LCG uses one system…”
““training” What source of training is this?”
• All plans are integrated with NGS and EGEE in
these areas and expanded upon in appendix A7
National Context Document
6/September/2006
GridPP3
D. Britton
Overall Scientific
Assessment
“This proposal is fundable and should be
funded. Because of its significance to an
extensive research community a decision to
proceed should be made quickly.”
• GridPP agrees
• The outline answers provided to the
referees’ questions are provided in
anticipation of such a PPRP decision
6/September/2006
GridPP3
D. Britton
Referee 2
•
•
•
•
•
•
•
•
•
•
•
•
•
Proposal Details: Reference number: PP/E00296X/1, Grant panel: Projects peer review panel, Grant type:
Standard.
The Proposal: Science quality: I really cannot comment on the pure science, not being a particle physicist. The
proposal itself deals with deploying and operating a production GridPP, and as such is mostly infrastructural
engineering and computer science of a software engineering flavour, rather than pure research. This is as it
should be for a proposal of this type.
In this sense the proposal is of a high quality. It is of course worthwhile in that it will be impossible for the UK
particle physics community to fully engage with the LHC without GridPP3.
Objectives: The grand objectives are clear enough in the executive summary, the more detailed objectives are
distributed throughout the proposal, and perhaps could benefit from a summary tabulation. The objectives are
sound but ambitious to an extent that perhaps threatens availability.
Management: Based on GridPP2, appears to work well.
Program Plan: Timescales & milestones hard to find.
Significance: This is a very significant infrastructure for the future of particle physics in the UK.
c/f Other Work: GridPP has performed very well in the EU context, and also in experimental transatlantic work,
and is a central partner in EGEE. The proposed infrastructure is a part of an overall global grid required for
LHC.
Methodology: A continuation and expansion from GridPP2, and likely to be successful if the manpower
resources are adequate to the task.
Industry: Limited proposals.
Planning: The related planning documents exhibit a good degree of coherency.
Past Record: The past performance has been good to excellent.
Suitability: Very suitable.
6/September/2006
GridPP3
D. Britton
Project Plan?
“Timescales & milestones hard to find.”
• The intention is to use the project
management methods used (successfully)
in GridPP1 and GridPP2
• The approach taken to GridPP3 is different
to that of GridPP1(2) planning
• A set of high-level deliverables can be
prepared in the light of PPRP feedback, if
requested
6/September/2006
GridPP3
D. Britton
Backup Slides
6/September/2006
GridPP3
D. Britton
GridPP2 ProjectMap

Production Grid Milestones
0.1
0.18
0.35
0.52
0.2
0.3
0.19 0.20
0.36 0.37
0.53 0.54

1
LCG

0.4
0.5
0.21 0.22
0.38 0.39
0.55 0.56

Design
1.1

1.1.1 1.1.2 1.1.3 1.1.4
1.1.5
Service Challenges

1.2

1.2.1 1.2.2 1.2.3 1.2.4
1.2.5

0.6
0.7
0.23 0.24
0.40 0.41
0.57 0.58
Development
1.3

1.3.1 1.3.2 1.3.3
0.8
0.9
0.25 0.26
0.42 0.43
0.59 0.60
0.10
0.27
0.44
0.61
0.11
0.28
0.45
0.62

2
M/S/N

Metadata
2.1

2.1.1 2.1.2 2.1.3 2.1.4 2.1.5
2.1.6 2.1.7 2.1.8 2.1.9 2.1.10
2.1.11 2.1.12

Storage
2.2

2.2.1 2.2.2 2.2.3 2.2.4 2.2.5
2.2.6 2.2.7 2.2.8 2.2.9 2.2.10
2.2.11 2.2.12 2.2.13 2.2.14 2.2.15

Workload
2.3

2.3.1 2.3.2 2.3.3 2.3.4 2.3.5
2.3.6 2.3.7 2.3.8 2.3.9 2.3.10
2.3.11

Security
2.4

2.4.1 2.4.2 2.4.3 2.4.4 2.4.5
2.4.6 2.4.7 2.4.8 2.4.9 2.4.10
2.4.11 2.4.12 2.4.13 2.4.14 2.4.15
Metric
OK
Navigate down 
External link
Other Link
88


(91%)
0.14 0.15
0.31 0.32
0.48 0.49
0.16 0.17
0.33 0.34
0.50 0.51

ATLAS
3.1

3.1.1 3.1.2 3.1.3 3.1.4 3.1.5
3.1.6 3.1.7 3.1.8 3.1.9 3.1.10
3.1.11 3.1.12 3.1.13

GANGA
3.2

3.2.1 3.2.2 3.2.3 3.2.4 3.2.5
3.2.6 3.2.7

LHCb
3.3

3.3.1 3.3.2 3.3.3 3.3.4 3.3.5
3.3.6 3.3.7 3.3.8 3.3.9 3.3.10
3.3.11 3.3.12 3.3.13

0.100 0.101 0.102 0.103 0.104 0.105 0.106 0.107 0.108 0.109 0.110 0.111 0.112 0.113 0.114 0.115 0.116
0.117 0.118 0.119 0.120 0.121 0.122 0.123 0.124 0.125 0.126 0.127 0.128 0.129 0.130 0.131 0.132 0.133
0.134 0.135 0.136 0.137 0.138 0.139 0.140 0.141 0.142 0.143 0.144 0.145 0.146 0.147

3
LHC Apps
CMS
3.4

3.4.1 3.4.2 3.4.3 3.4.4 3.4.5
3.4.6 3.4.7 3.4.8 3.4.9 3.4.10
3.4.11 3.4.12 3.4.13 3.4.14 3.4.15
InfoMon
PhenoGrid
Metric
Tasks
Tasks

2.5


3.5

2.5.2 2.5.3 2.5.4 2.5.5
3.5.1 3.5.2 3.5.3 3.5.4 3.5.5
not2.5.1
OK
Complete
Overdue
2.5.6 2.5.7 2.5.8 2.5.9 2.5.10
3.5.6 3.5.7 3.5.8 3.5.9
9

103

2.6.1 2.6.2 2.6.3 2.6.4 2.6.5
2.6.6 2.6.7 2.6.8 2.6.9 2.6.10
2.6.11 2.6.12 2.6.13
6/September/2006
(40%)

4
Non-LHC Apps

BaBar
4.1

4.1.1 4.1.2 4.1.3 4.1.4 4.1.5
4.1.6 4.1.7 4.1.8 4.1.9 4.1.10
4.1.11 4.1.12

SamGrid
4.2

4.2.1 4.2.2 4.2.3 4.2.4 4.2.5
4.2.6 4.2.7 4.2.8 4.2.9 4.2.10
4.2.11 4.2.12 4.2.13 4.2.14 4.2.15

Portal
4.3

5
Management

6
External
Project Planning
5.1


5.1.1 5.1.2 5.1.3 5.1.4 5.1.5
5.1.6 5.1.7 5.1.8 5.1.9 5.1.10
5.1.11 5.1.12


6.1.1 6.1.2 6.1.3 6.1.4 6.1.5
6.1.6 6.1.7 6.1.8 6.1.9
Project Execution
5.2


5.2.1 5.2.2 5.2.3 5.2.4 5.2.5
5.2.6 5.2.7 5.2.8 5.2.9 5.2.10
5.2.11 5.2.12 5.2.13 5.2.14 5.2.15
Dissemination
6.1
Interoperability
6.2

6.2.1 6.2.2 6.2.3 6.2.4 6.2.5
6.2.6 6.2.7 6.2.8 6.2.9 6.2.10
6.2.11 6.2.12 6.2.13 6.2.14


Engagement
6.3

4.3.1 4.3.2 4.3.3 4.3.4 4.3.5
4.3.6 4.3.7 4.3.8 4.3.9 4.3.10
4.3.11 4.3.12 4.3.13
6.3.1 6.3.2 6.3.3 6.3.4 6.3.5
UKQCD
4.4
Knowledge Transfer

6.4



4.4.1 4.4.2 4.4.3 4.4.4 4.4.5
4.4.6 4.4.7 4.4.8 4.4.9
6.4.1 6.4.2 6.4.3 6.4.4
Tasks dueStatus
in Date -Items
Tasks
31/Dec/05
+ next
60 Days
next 60 daysMonitor OKInactive1.1.1 not Due
2.5.11
Network
2.6

Production Grid Metrics
0.12 0.13
0.29 0.30
0.46 0.47
7
LHC Deployment

3.6

3.6.1 3.6.2 3.6.3 3.6.4 3.6.5
3.6.6 3.6.7 3.6.8 3.6.9 3.6.10
GridPP3
16
Monitor not OK
Milestone complete
Milestone overdue
Milestone due soon
Milestone not due soon
Item not Active
20
1.1.1
1.1.1
1.1.1
1.1.1
1.1.1
1.1.1
Update
132
Clear
D. Britton
Change
Forms
37
Convergence with NGS
6.2
Owner:
Number
6.2.1
6.2.2
6.2.3
6.2.4
6.2.5
6.2.6
6.2.7
6.2.8
6.2.9
6.2.10
6.2.11
6.2.12
6.2.13
6.2.14
6.2.15
Interoperability
Neil Geddes
Title
Common GridPP/NGS helpdesk and problem tracking infrastructure
Joint GridPP/NGS plan for web services deployment
First jointly supported service
Common security policy
Integrated plan for Grid support in the UK beyond 2007
First stage connection of GridPP sites to NGS
Second stage connection of GridPP sites to NGS
Final stage connection of GridPP sites to NGS
Number of NGS representatives on GridPP committees
GridPP attendance at NGS committee meetings
Number of Non-HEP applications tested on GridPP Grid
Number of GridPP members attending GGF meetings
Number of GridPP members in charge of formal GGF Working Groups
Implemented Common Security Policy
Status Date
30-Jun-06
Due
01-Oct-06
01-Jan-06
01-Jun-05
01-Jan-05
01-Apr-06
01-Apr-05
01-Apr-06
31-Aug-07
On going
On going
On going
On going
On going
01-Nov-05
Status
Complete
In Progress
Complete
Complete
In Progress
In Progress
Not Started
Not Started
OK
OK
OK
OK
OK
Complete
- The slow emergence of real web-services solutions means that 6.2.2 will probably not be
completed during GridPP2.
- GridPP is committed to gLite and NGS intends to be compatible with this but can not deploy
the full gLite stack.
- GridPP collaboration is discussing formal affiliation with NGS and presently Edinburgh are
NGS affiliates and Oxford, RAL, Manchester, and Lancaster are partners. Discussions
underway with Glasgow, UCL, and IC.
6/September/2006
GridPP3
D. Britton
In the Beginning…
The UK Grid for HEP really started to grow in 2000 with the release of the
Hoffman report into LHC computing requirements and the results of the UK
Government Spending Review (SR2000) which targeted £80m for e-Science.
Academic Application Support
Programme
Research Councils (£74m), DTI (£5m)
PPARC (£26m)
BBSRC (£8m)
MRC (£8m)
NERC (£7m)
£80m
ESRC (£3m)
EPSRC (£17m)
CLRC (£5m)
Generic Challenges
EPSRC (£15m), DTI (£15m)
Collaborative projects
Industrial Collaboration (£40m)
6/September/2006
GridPP3
D. Britton
Hardware Costs
Storage Cost
10
Kryder’s Law for disk cost
10
9
8
5
6
Price/TB (K)
LN(Price/PB)
4
4
3
Estimated Price per KSI2K
2
2
1.8
8
1
Moore’s Law for CPU cost
1.6
0
0
Hardware costs extrapolated
from recent purchases.
However, experience tells us
there are fluctuations
associated with technology
steps. Significant uncertainty
in integrated cost.
6
1.2
5
1
4
0.8
3
0.6
2
0.4
1
0.2
0
0
2002
6/September/2006
7
1.4
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Price/KSI2K (£K)
£K
6
LN(£K)
7
8
Model must factor in:
- Operational life of equipment
- Known operational overheads
- Lead time for delivery and
deployment.
GridPP3
2003
2004
2005
2006
2007
Price/KSI2K (£K)
2008
2009
LN(Price/MSI2K)
D.
Britton
2010
2011
2012
LN(Price/MSI2K)
12
Hardware Costs: Tape
2007
2008
out beam total required bandwidth
Total available CASTOR bandwidth
CAPACITY MODEL
Required capacity
Actual CASTOR Capacity
816
544
2538
2538
9940 Media
Existing 9940 Slot Count
Media Capacity (9940)
Existing 9940 Capacity
Spent on new Robot Infrastructure
New Slots Purchased
Maximum Slot Count Available
Total Used Slots
In beam write capacity required
Out Beam write capacity required
In beam read capacity required
Out Beam read capacity required
6/September/2006
7682
10516
9753
10129
9940B Drives
9940B Maintainance Cost/drive
0
0
0
0
Spent
on 9940B
Maintainance
0.182
0.182
0.182
0.182
Bandwidth per brick (MB/s)
0.000
0.000
0.000
0.000
9940B Bandwidth
816
0
430
194
350
778
1208
0.08
0.45
62
2538
0
1208
544
1994
4432
5639
0.07
0.45
310
5000
3156
250
6000
11000
5639
Cost of Storage Brick (T20)
T20K Maintainance Cost/drive
New T20K 50
Server Bricks
50
Total T20K
Server Bricks
2000
2000
11000
13000
13000
Bandwidth
per brick
(MB/s) 15000
10684
11684
Spent on
Server 11254
bricks 13428
Spent on T20K Maintaince
Total T10K Bandwidth
114
228
17
114
252
25%
25%
151
191
137
155
301
275
309
Spent on381
ADS Maintainance
319
Spent on Minor Parts
151
191
137
155
Spent on764
Robot 1 M&O
478
970
1202
Spent on25%
Robot 2 25%
M&O
25%
25%
25%
25%
25%
25%
32
25%
25%
327
0
174
337
Cost of Storage Brick (T10)
4808
7682
9753
12085
T10K
Maintainance
Cost/drive778
0
0
430
New T10K
Server11254
Bricks 10476
5639
10684
Total T10K
Server10129
Bricks
2538
9616
9429
Bandwidth per
2270
0 brick (MB/s)
0
2656
5045
0
2951
Spent on1000
Server bricks
10684
11684
11254
13428
Spent on
T10K Maintaince
0.06
0.06
0.06
0.06
Total T10K
Bandwidth
0.45
0.9
0.9
0.9
303
60
0
177
401
933
Summary
0
0
201
679
Spent on1019
Media
638
366
0
183
1294
555
501
337
640
602
638
720
1613
1019
1680
549
1294
1320
619
1603
1680
6
3.0
18
25
150
3
3.3
9.9
25
75
0
3.6
0
25
0
0
4.0
0
25
0
0
4.4
0
25
0
0
4.8
0
25
0
19.15
2.3
3
6
80
57.45
6.9
480
19.15
2.3
2
8
80
38.3
13.8
640
19.15
2.3
1
9
80
19.15
18.4
720
19.15
2.3
0
9
80
0
20.7
720
19.15
2.3
19.15
2.3
0
80
0
0
80
0
0
0
19.15
2.3
3
11
120
57.45
18.4
1320
19.15
2.3
3
14
120
57.45
25.3
1680
12085
12085
0
0.182
0.000
Bandwidth MODEL
Estimated rate to Fill (6 months)
In beam Double Fill Rate
In beam Media Conversion (6 months)
In beam reprocessing
Out of beam Reprocessing Read Rate (4 months?)
Drive deadtime on writes
Drive deadtime on Reads
4808
4808
1948
0.182
324
T10/20K Media
Total Required Tape Capacity April (TB)
Tapes phased out in March
Total Tapes Available in March
Total Storage Capacity (March)
Addirtional TB Required for April
Additional Tapes Purchased
Used Slots April (T10/20K)
T10/20K Media Cost
Media Capacity
Spent on Media
2009
2011bandwidth
2012
In beam 2010
total required
GridPP3
412
0
206
1603
Spent on Bandwidth and Operation
Spent Total
0
0
0
0
0
0
0
0
0
19.15
2.3
8
8
120
153.2
0
960
10
10
30
10
10
30
50
0
10
30
50
0
10
30
55
0
10
30
55
0
10
30
60
62
132
195
310
412
722
60
319
379
0
189
189
177
258
435
303
D. Britton
128
430
Running Costs
Running Costs
CPU
New Systems
New Racks
Phased out racks
Rack Count
KW/New System
Disk
2007
2008
2009
2010
2007
2008
2009
2010
166
5
4
18
0.26
761
24
3
39
0.26
198
18
330
404
13
5
47
0.27
110
51
390
473
15
0
61
0.29
136
0
525
101
14
3
32
0.735
201
29
4
57
0.77
155
14
257
82
12
0
69
0.81
66
0
323
134
19
10
78
0.85
114
49
388
0.00008
0.00008
0.00009
0.00008
0.00008
0.00009
£347k
£430k
£609k
£270k
£357k
£450k
New KW
Phased Out KW
Total Load (KW)
Cost Per KW
151
Cost
£0k
116
£0k
(Work in progress)
6/September/2006
GridPP3
D. Britton
Tier-1 Growth
Now
Spinning Disks
Yearly disk failures
~2000
30-45
CPU Systems
Yearly system failures
~550
35-40
Start of GridPP3
~10,000
200-300?
~1800
120-130?
End of GridPP3
~20,000
400-600?
~2700
180-200?
To achieve the levels of service specified in the MOU, a multi-skilled incident
response unit (3 FTE) is proposed. This is intended to reduce the risk of overprovisioning other work areas to cope with long term fluctuations in fault rate.
These staff will have an expectation that their primary daily role will be dealing
with what has gone wrong. They will also provide the backbone of the primary
callout team.
6/September/2006
GridPP3
D. Britton
Tier-2 Allocations
• Take each experiment’s CPU and Disk requirements
(from Dave Newbold)
• For each experiment – share out among Tier-2s
• For each Tier-2 share out among institutes
• Sum over experiments
(maintains the correct CPU/Disk ratio)
Sharing guided by:
• Size of local community (number of Ac/Ph/PP)
• Past delivery (KSI2K to date, Disk usage last quarter)
• Current resources available
6/September/2006
GridPP3
D. Britton
Tier-2 ‘Shares’
Physicists
FTEs
Tier-2
London
LHC Only
40
26%
Existing Resources
1Q06
KSI2K
TB
1049.0
37.7
Delivery to date
KSI2K Hrs
27%
1,348,236
Disk used
1Q06
TB
39%
17.9
21%
Summary
Min
Max
Ave
21%
39%
28%
~35%
NorthGrid
33
22%
1783.1
132.2
48%
1,229,271
36%
34.2
40%
22%
48%
36%
~35%
ScotGrid
14
9%
354.0
44.6
10%
187,443
5%
21.0
24%
5%
24%
12%
~10%
SouthGrid
Total
66
43%
152
6/September/2006
516.4
48.4
3702.5
262.9
15%
661,080
3,426,030
GridPP3
19%
13.4
15%
15%
43%
~20%
86.6
D. Britton
23%
Example
CMS Requirement in 2008 is 1800 KSI2K and 400 TB
Tier-2 sharing
matrix:
(PMB/Tier-2
Board)
ATLAS
(Tier-2 Board)
LHCb
Other
London
0.25
0.75
0.10
0.30
NorthGrid
0.50
0.00
0.20
0.40
ScotGrid
0.15
0.00
0.30
0.10
SouthGrid
0.10
0.25
0.40
0.20
ATLAS
Institute
sharing matrix:
CMS
CMS
LHCb
Other
Brunel
0.00
0.10
0.00
0.15
Imperial
0.00
0.90
1.00
0.00
QMUL
0.70
0.00
0.00
0.60
RHUL
0.20
0.00
0.00
0.15
UCL
0.10
0.00
0.00
0.10
i.e Imperial ‘allocation’ is 1800 KSI2K (400 TB) x 0.75 x 0.9 =
1215 KSI2K (270 TB)
6/September/2006
GridPP3
D. Britton
Crosscheck:
Allocated CPU v 'Size '
Imperial
Glasgow
Liverpool
Allocated CPU
25.0%
20.0%
15.0%
10.0%
5.0%
0.0%
0%
Bristol
6/September/2006
5%
10%
15%
20%
Size/Delivery/Current
GridPP3
D. Britton
Tier-2 Staff
6/September/2006
FTE
0.50
1.50
1.00
0.50
1.00
1.50
1.00
1.50
1.00
0.25
0.50
1.00
1.00
1.00
0.50
0.50
0.50
14.75
FTE %
3%
10%
7%
3%
7%
10%
7%
10%
7%
2%
3%
7%
7%
7%
3%
3%
3%
100%
Allocated FTE v CPU
Sheffield
Manchester
Imperial
12%
10%
Allocated FTE
Institute
Brunel
Imperial
QMUL
RHUL
UCL
Lancaster
Liverpool
Manchester
Sheffield
Durham
Edinburgh
Glasgow
Birmingham
Bristol
Cambridge
Oxford
RAL PPD
8%
6%
4%
2%
0%
0.0%
5.0%
Birmingham and UCL
GridPP3
10.0%
15.0%
Allocated CPU
D. Britton
20.0%
25.0%
Proposal Procedure
Proposal
Tier-1
Tier-2
Middleware
Applications
Management
…
Total
£Am
£Bm
£Cm
£Dm
£Em
6/September/2006
Peer
Review
£Xm
Proposal
Tier-1
Tier-2
Middleware
Applications
Management
…
Total
GridPP1/
GridPP2
£Am
£Bm
£Cm
£Dm
£Em
£Xm
Re-evaluation
Tier-1
Tier-2
Middleware
Applications
Management
…
Total
Allocate
Institute
Institute
Institute
Institute
Institute
…
Total
1
2
3
4
5
£Fm
£Gm
£Hm
£Im
£Jm
£XM
GridPP3
£am
£bm
£cm
£dm
£em
£YM
GridPP3
Peer
Review
Apply for Grants
Institute 1
Institute 2
Institute 3
Institute 4
Institute 5
…
Total £YM
£fm
£gm
£hm
£im
£jm
Institute 1 £fm
Institute 2 £gm
Institute 3 £hm
Is this still a
Institute 4 £im
Institute 5sensible
£jm
project?
…
Total
£YM
D. Britton
GridPP3 Deployment Board
In GridPP2, the Deployment Board is squeezed into a space already occupied by
the Tier-2 Board; the D-TEAM; and the PMB. Many meetings have been “joint”
with one of these other bodies. Identity and function have become blurred.
Project Management Board
X
Deployment Board
T1B Chair
T2B Chair
Prdn Mgr.
Tier-1 Board
Tier-2 Board
D-Team
Technical Coordinator
Grp-1
Grp-2
…
Grp-n
In GridPP3, propose a combined Tier-2 Board and Deployment Board with overall
responsibility for deployment strategy to meet the needs of the experiments. In
particular, this is a forum where providers and users formally meet.
Deals with:
1) Issues raised by the Production Manager which require strategic input.
2) Issues raised by users concerning the service provision.
3) Issues to do with Tier-1 - Tier-2 relationships.
4) Issues to do with Tier-2 allocations, service levels, performance.
5) Issues to do with collaboration with Grid Ireland and NGS.
6/September/2006
GridPP3
D. Britton
GridPP3 DB Membership
1) Chair
2) Production Manager
3) Technical Coordinator
4) Four Tier-2 Management Board chairs.
5) Tier-1 Board Chair.
6) ATLAS, CMS, LHCb representatives.
7) User Board Chair.
8) Grid Ireland representative
9) NGS representative.
10) Technical people invited for specific issues.
Above list gives ~13 core members, 5 of whom are probably on PMB. There is a
move away from the technical side of the current DB and it becomes a forum
where the deployers meet each other and hear directly from the main users. The
latter is designed to ensure buy-in by the users to strategic decisions.
6/September/2006
GridPP3
D. Britton
Grid Data Management
Components:
File transfer services.
Metadata Catalogues.
Services to manage the replication of data.
Operational Support: FTS; metadata catalogues as they are deployed;
replica optimisation services eventually.
Maintenance: Metadata services and eventually replica optimisation
services.
Development: Common metadata services; Replica optimisation.
6/September/2006
GridPP3
D. Britton
Storage Management
Components:
DPM (used at 12 Tier-2 sites in UK)
dCache (used at Tier-1 and 7 Tier-2 sites in UK)
CASTOR SRM1 (Tier-1 but to be phased out in 2006)
CASTOR SRM2 (Tier-1 - primary developer).
Operational Support: All above components. Hope to reduce number.
Maintenance: GridPP “owns” dCache installation and configuration scripts
within LCG, and the SRM2 interface to CASTOR.
Development: None envisaged in GridPP3 era. However, SRM version-3 may
impose some requirements
6/September/2006
GridPP3
D. Britton
Information and
Monitoring
Components:
R-GMA (information system slated to replace the BDII)
Service Discovery (SD)
APEL accounting (uses R-GMA)
GLUE Schema (information model to define Grid resources)
Operational Support: R-GMA
Maintenance: R-GMA and SD.
Development: R-GMA may still require development at start of GridPP3. Glue
schema likely to require ongoing development (minor effort).
6/September/2006
GridPP3
D. Britton
Workload, Performance
and Portal
Components:
WMS (Resource Broker, Logging & Bookkeeping server etc).
Tools to gather job information (used by ATLAS, CMS, and the RTM).
Real Time Monitor (RTM).
GridPP Portal.
Operational Support: WMS, Job information repository. Job information
analysis.
Maintenance: WMS-testing, Job information scripts, RTM, Portal.
Development: Portal (to address needs of new users);
Job information scripts (to enrich/optimise content);
(Possibly RTM if evolution still required/desired).
6/September/2006
GridPP3
D. Britton
Security
Components:
- GridSite Toolkit (includes Grid Access Control Language GACL and
GridSite’s Apache extension mod_gridsite both used by ATLAS and CMS)
- VOMS
Operational Support: GridSite and VOMS. Operational Security Officer Post.
International Security Coordination Post.
Maintenance: GridSite
Development: GridSite
6/September/2006
GridPP3
D. Britton
Networking
Components:
- High level contacts with JISC and UKERNA.
- Requirements and provisioning - Work with providers in respect to
interfaces to Grid – Network operations.
- Network monitoring and diagnostic tools.
Operational Support: Network monitoring and diagnostics.
Maintenance: Minor.
Development: None.
6/September/2006
GridPP3
D. Britton
Active Users (All VOs)
6/September/2006
GridPP3
D. Britton
Active Users by LHC experiment
ALICE (8)
ATLAS (70)
CMS (150)
LHCb (40)
6/September/2006
GridPP3
D. Britton
Job success?
Overview
6/September/2006
GridPP3
D. Britton
Job Success by LHC experiment
ALICE
ATLAS
CMS
LHCb
6/September/2006
GridPP3
D. Britton