Transcript Document
GridPP3 David Britton 6/September/2006 6/September/2006 GridPP3 D. Britton Overview The GridPP3 proposal consists of a 7-month extension to GridPP2, followed by a three year GridPP3 project starting in April 2008. GridPP2+ (7 month extension from September 2007 to March 2008) - Early approval sought in order to ensure staff retention. - Provides continuity of management and support over the LHC start-up. - Aligns the project with (a) financial year; (b) EGEE and other EU projects. GridPP3 (3 year project from April 2008 to March 2011) - “From production to exploitation.” - Delivers large-scale computing resources in a supported environment. - Underpins the success of the UK contribution to the LHC. 6/September/2006 GridPP3 D. Britton Global Context Developing requirements Evolving standards LCG EDG EGEE S(Many) GridPP Changing Costs and budgets Experience EDG 2001 2002 6/September/2006 EGEE-I 2003 GridPP1 2004 2005 2006 GridPP2 GridPP3 EGI ? EGEE-II 2007 LHC Data Taking GridPP3 D. Britton WLCG MoU • 17 March 2006: PPARC signed the Memorandum of Understanding with CERN • Commitment to UK Tier-1 at RAL and the four UK Tier-2s to provide services and resources • Current MoU signatories: China France Germany Italy India Japan Netherlands Pakistan Portugal Romania Taiwan UK USA • Pending signatures: Australia Belgium Canada Czech Republic Nordic Poland Russia Spain Switzerland Ukraine 6/September/2006 GridPP3 D. Britton Grid Overview Aim: by 2008 (full year’s data taking) - CPU ~100MSi2k (100,000 CPUs) - Storage ~80PB - Involving >100 institutes worldwide - Build on complex middleware in Europe (Glite) and in the USA (VDT) 1. Prototype went live in September 2003 in 12 countries 2. Extensively tested by the LHC experiments in September 2004 3. 197 sites, 13,797 CPUs, 5PB storage in September 2005 4. 177 active sites, 26,527 CPUs, 10PB storage in September 2006 6/September/2006 GridPP3 D. Britton Tier-0 to Tier-1 • worldwide data transfers > 950MB/s for 1 week • peak transfer rate from CERN of >1.6GB/s • Ongoing experiment transfers as part of current service challenges 6/September/2006 GridPP3 D. Britton Tier-1 to Tier-2 • UK data transfers >1000Mb/s for 3 days • peak transfer rate from RAL of >1.5Gb/s • Require high data rate transfers (300-500Mb/s) to/from RAL as a routine activity 6/September/2006 GridPP3 D. Britton It’s in use: Active Users by LHC experiment ALICE (8) ATLAS (70) CMS (150) LHCb (40) 6/September/2006 GridPP3 D. Britton Tier Centres Tier-0 Tier-1 Tier-2 ALICE Reconstruction On-demand analysis Central simulation On-demand analysis ATLAS Reconstruction Scheduled analysis/ skimming Calibration Simulation On-demand analysis Calibration Reconstruction Scheduled analysis/ skimming Simulation On-demand analysis Calibration Reconstruction On-demand analysis Scheduled skimming Simulation CMS LHCb 6/September/2006 GridPP3 First-pass scheduled reconstruction D. Britton LHC Hardware Requirements Resource Tier-1 CPU [MSI2k] Required Non-UK Pledged GridPP3 Tier-1 Disk [PB] Required Non-UK Pledged GridPP3 Tier-1 Tape [PB] Required Non-UK Pledged GridPP3 Tier-2 CPU [MSI2k] Required Non-UK Pledged GridPP3 Tier-2 Disk [PB] Required Non-UK Pledged GridPP3 6/September/2006 ALICE 12.3 54% 0.16 1% 7.4 36% 0.11 1% 6.9 45% 0.10 1% ATLAS 24.0 89% 3.00 13% 14.4 81% 1.78 12% 9.0 90% 1.12 12% CMS 15.2 73% 1.56 10% 7.0 75% 0.84 12% 16.7 53% 1.44 9% LHCb 4.4 85% 0.74 17% 2.4 77% 0.41 17% 2.1 74% 0.35 17% 14.4 41% 0.18 1% 3.5 39% 0.05 1% 19.9 83% 2.66 13% 8.7 63% 1.14 13% 19.3 90% 1.80 9% 4.9 92% 0.40 8% 7.7 38% 2.17 28% n/a n/a n/a n/a GridPP3 ALICE: Based on UK M&O author fraction (1.2%). ATLAS: Based on UK fraction of Tier-1 Authors. CMS: Based on a threshold size for a minimum viable Tier-1. LHCb: Based on Authorship fraction (16.5%) and number of Tier-1s. Overall resource level reviewed by LHCC. Balance of CPU, Storage, and Network driven by computing models. D. Britton Non-LHC Hardware Requirements BaBar: Included explicitly, based on well understood resource requirement per fb-1 and the expected luminosity profile up to October 2008. Level is ~15% of Tier-1 CPU and Tape, and 9% Disk in 2008. UKQCD: Request received after planning stage completed so not included in the model. (Some uncertain whether UKQCD will move to LCG-based Grid and how manpower would be funded). Level is 3%-4% of Tier-2 resources and ~7% of Tier-1 tape in 2008. Others: The requirements of other, smaller, user groups and some provision for future larger groups (LC, Neutrino) where the requirements are currently largely unknown, have been addressed with a 5% envelope allocation of Tier2 Disk and CPU, and Tier-1 Tape. 6/September/2006 GridPP3 D. Britton Budget Overview Cost Table Tier-1 Tier-1 50% Travel+Other 2% Outreach 1% Management 3% 6/September/2006 Tier-2 25% Operations. 6% Support 13% GridPP3 Staff Hardware Tier-2 TOTAL [£m] 4.99 11.72 Staff 3.29 Hardware 5.12 Grid Support Staff 4.50 Grid Operations Staff 1.89 Management 1.17 Outreach 0.37 Travel and Other 0.84 Project Sub-Total 33.90 Working Allowance (4%) 1.25 Project Cost 35.15 Contingency (12%) 4.15 Tier-1 Running Costs 2.50 Full Approval Cost 41.80 D. Britton Tier-1 Centre Tier-1 CPU capacity by expt Defined by the experiment hardware requirements, the experiment computing models, a hardware costing model, and by the service levels defined in the international MOU signed by PPARC 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 A LI C E A T LA S BA BA R C MS LH C b O thers 2008 2010 2011 2012 Tier-1 disk capacity by expt 10000 A LI C E A T LA S BA BA R C MS LH C b O thers 8000 6000 4000 2000 0 2008 T1 to T1 (T1 to T0 to T1 T2 to T1 T1 to T2 T0) CPU to disk Disk to CPU Disk to tape Tape to disk ATLAS 940 2361 264 34 610 165 25 105 CMS 423 1590 242 45 240 240 58 360 LHCb 212 278 63 54 184 184 3 0 Total 1130 4229 505 133 752 589 86 465 Estimated Tier-1 peak data flows in 2008 [MB/s] 6/September/2006 2009 2010 2011 2012 Tier-1 tape capacity by expt 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 A LI C E A T LA S BA BA R C MS LH C b O thers 2008 GridPP3 2009 2009 2010 2011 D. Britton 2012 Tier-1 Centre: Service Level Service Maximum delay in responding to operational problems Service interruption Acceptance of data from the Tier-0 Centre Networking service to the Tier-0 Centre during accelerator operation Data-intensive analysis services, including networking to Tier-0, Tier-1 Centres All other services – prime service hours All other services – other times 6/September/2006 12 hours Degradation of the Degradation of the capacity of the service capacity of the service by more than 50% by more than 20% 12 hours 24 hours Average availability measured on an annual basis During accelerator operation 99% At all other times n/a 12 hours 24 hours 48 hours 98% n/a 24 hours 48 hours 48 hours 98% 98% 2 hour 2 hour 4 hours 98% 98% 24 hours 48 hours 48 hours 97% 97% GridPP3 D. Britton Tier-1 Centre: Staff Core services refer to user-file systems, monitoring, software deployment and conditions database. GridPP3 Work Area PPARC funding CCLRC funding CPU 2.0 0.0 Disk 3.0 0.0 Tape and CASTOR 2.0 1.3 Core Services 1.0 0.5 Operations 3.0 1.0 Incident Response Unit 3.0 0.0 Networking 0.0 0.5 Deployment 1.5 0.0 Experiments 1.5 0.0 Tier-1 Management 1.0 0.3 18.0 3.6 Totals 6/September/2006 Operations refers to machine-room environment, hardware diagnostics/repair, automation, fabric management, tapemovement etc. Incident Response Unit addresses MOU service requirement including out-of-hours call out. GridPP3 D. Britton Tier-2 Centres GridPP has successfully developed four distributed Tier-2 Centres which have: - Engaged the institutes; - Levered large amounts of resources; - Developed local expertise; - Stimulated cross-disciplinary relationships; - Help promote the Grid, GridPP, Particle Physics, and the local groups within the universities. Successes: Development of regional management structure; MOU signed by each institute with GridPP; deployment of complex middleware; accounting; security; data-transfers; all fully operational and contributing to LCG. 6/September/2006 GridPP3 D. Britton Tier-2 Centres To match the LHC computing models around 50% of the UK computing resources will be located at the Tier2s. Service levels are not as demanding as at the Tier-1. Service Maximum delay in responding to operational problems Prime time Other periods End-user analysis facility Other services 2 hours 12 hours Average availability measured on an annual basis 72 hours 72 hours 95% 95% Distributed nature of the UK Tier-2 has technical advantages (“divide and conquer”) and technical drawbacks (“inefficiencies”). Importance of political/social aspects should not be underestimated. Tier-2 total CPU capacity by expt Tier-2 total disk capacity by expt 14000 7000 12000 6000 ALICE ATLAS BABAR CMS LHCb Others UKQCD 10000 8000 6000 4000 4000 3000 2000 1000 2000 0 6/September/2006 2008 2009 ALICE ATLAS BABAR CMS LHCb Others UKQCD 5000 2010 2011 2012 GridPP3 0 2008 2009 D. Britton 2011 2010 2012 Tier-2 Market Model 1) Assume all Institutes involved are interested in building on their current contribution so that… 2) Effectively a “market” exists to provide Tier-2 resources to HEP (because many Institutes have dual-funding opportunities and/or internal reasons to be involved). 3) GridPP offers a market-price for Tier-2 resources which institutes may or may not chose to accept. 4) The market price is adjusted to optimise resources obtained. 5) The market price is bounded by what it would cost to provision the resources at the Tier-1. Inefficiencies associated with the distributed nature of the Tier-2s may be balanced by an increase in competition/leverage. 6/September/2006 GridPP3 D. Britton Tier-2 Hardware Allocations Fraction of Experiment allocated to each Tier-2 ATLAS CMS LHCb Other London 0.25 0.75 0.10 0.30 NorthGrid 0.50 0.00 0.20 0.40 ScotGrid 0.15 0.00 0.30 0.10 SouthGrid 0.10 0.25 0.40 0.20 Constrained by the requirement for Institutional JeS forms GridPP made an initial mapping (or allocation – i.e. not quite the “market” approach intended) of Tier-2 hardware. Relative fraction of Experiment allocated to each Institute within the Tier-2 ATLAS CMS LHCb Other ATLAS CMS LHCb Other 0.40 0.00 0.00 0.30 Brunel 0.00 0.10 0.00 0.15 Birmingham Imperial 0.00 0.90 1.00 0.00 Bristol 0.00 0.50 0.25 0.20 QMUL 0.70 0.00 0.00 0.60 Cambridge 0.25 0.00 0.25 0.15 RHUL 0.20 0.00 0.00 0.15 Oxford 0.25 0.00 0.25 0.25 UCL 0.10 0.00 0.00 0.10 RAL PPD 0.10 0.50 0.25 0.10 Lancaster 0.50 0.00 0.00 0.50 Durham 0.10 0.00 0.10 0.25 Liverpool 0.20 0.00 1.00 0.15 Edinburgh 0.00 0.00 0.40 0.25 Manchester 0.20 0.00 0.00 0.20 Glasgow 0.90 0.00 0.50 0.50 Sheffield 0.10 0.00 0.00 0.15 6/September/2006 GridPP3 Allocations based on past-delivery; current size; and size of the local community of physicists. D. Britton Tier-2 Staff Allocations GridPP currently funds 9 FTE at 17 institutes. In GridPP3, this is proposed to increase to 14.75 FTE (c.f. Tier-1 has 18 FTE funded by GridPP3 for a comparable amount of hardware). Again, in this market approach this is the “effort (currently) offered” and not an estimate of the “full effort needed”. London FTE NorthGrid FTE ScotGrid FTE SouthGrid Brunel 0.50 Lancaster 1.50 Durham 0.25 Birmingham 1.00 Imperial 1.50 Liverpool 1.00 Edinburgh 0.50 Bristol QMUL 1.00 Manchester 1.50 Glasgow RHUL 0.50 Sheffield UCL 1.00 Total 4.50 6/September/2006 5.00 GridPP3 1.00 1.00 Cambridge 1.00 1.75 FTE 0.50 Oxford 0.50 RAL PPD 0.50 3.50 D. Britton Tier-2 Hardware Costs (Agreed by CB) CPU (KSI2K) 2007 2008 2009 2010 2011 2012 7560 10215 14522 18203 21708 1559 2106 2994 3753 4476 £0.392k £0.312k £0.247k £0.175k £0.124k £0.087k £612k £656k £740k £656k £553k £0k £1,163k £1,295k £1,383k £1,282k £1,120k £0k Requirement Amount paid for Unit Cost Cost £k Total (inc Disk) • Take requirement in following year 7560 divided by the lifetime in years (4.85 CPU, 3.9 Disk) = 1559 • Multiply by the unit cost in that year £0.392k/KSI2K = £612k • Similarly for disk. • Up to institutes how they spend it (new kit, replacement kit, central services … ) 6/September/2006 GridPP3 D. Britton Tier-2 Resources Sanity Checks: 1) Compare cost to GridPP of hardware at the Tier-1 and Tier-2 CPU (K£/KSI2K-year): integrated over the lifetime of the DISK (K£/TB-year): TAPE (K£/TB-year): project. 2) Total cost to project: Can compare (Staff + Hardware) cost of the Tier-2 facilities with the cost to the project of placing the same hardware at the Tier-1 (assuming that doubling the Tier-1 hardware requires a 35% increase in staff). 6/September/2006 GridPP3 Tier-1 0.070 0.144 0.052 Tier-2 0.045 0.109 Including staff and hardware, the cost of the Tier-2 facilities is ~80% of cost of an enlarged Tier-1. D. Britton Budget Overview Cost Table Tier-1 Tier-1 50% Travel+Other 2% Outreach 1% Management 3% 6/September/2006 Tier-2 25% Operations. 6% Support 13% GridPP3 Staff Hardware Tier-2 TOTAL [£m] 4.99 11.72 Staff 3.29 Hardware 5.12 Grid Support Staff 4.50 Grid Operations Staff 1.89 Management 1.17 Outreach 0.37 Travel and Other 0.84 Project Sub-Total 33.90 Working Allowance (4%) 1.25 Project Cost 35.15 Contingency (12%) 4.15 Tier-1 Running Costs 2.50 Full Approval Cost 41.80 D. Britton Grid Support Refers to staff effort for the support of Middleware, Security and Networking areas in GridPP3. The emphasis is on a managed transition from middleware development to middleware support (operational and bug-fixing). Three criteria applied to guide prioritisation of areas for support: 1) Areas which are “mission critical” for the UK. 2) Areas which are viewed as established “international obligations”. 3) Areas which provide significant leverage to the obvious advantage of GridPP Background documents discuss areas in terms of: a) Operational Support b) Maintenance (bug-fixing) c) Development (phased out where practical). 6/September/2006 GridPP3 D. Britton Grid Support Areas Area Grid Data Management Storage Information & Monitoring Role Operational Service Support FY08 FY09 FY10 Metadata Replica Management Castor Support DPM Support dCache Support Operational Service Support 1.0 1.0 1.0 1.0 1.0 1.0 R-GMA & Support Service Discovery Support 3.0 1.5 0.6 0.5 0.3 0.2 GLUE & International Collaboration 0.5 0.2 0.2 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5 1.0 1.0 1.0 1.0 0.5 0.5 1.0 Workload, Performance & Portal Security Networking Performance Testing Real-Time Monitoring Portal Support Operational Security Officer 1.0 0.5 0.5 0.5 1.0 0.5 0.5 0.5 0.5 1.0 0.5 0.5 0.5 0.5 1.0 GridSite Support VOMS Support International Security Co-ordination 1.5 0.5 0.8 1.5 0.5 0.8 1.5 0.5 0.8 Requirements & Provisioning 0.5 0.5 0.5 Performance Monitoring 0.5 0.5 0.5 TOTALS 6/September/2006 GridPP3 18.3 15.3 13.8 D. Britton Grid Support Staff Evolution GridPP2 GridPP2 GridPP WLMS Portal Data Management Data Storage Security InfoMon Network HP Post GRAND TOTAL 6/September/2006 Tier-2 Expert MSN Apps. Interface Tier-2 Expert MSN Tier-2 Expert MSN Tier-2 Expert MSN Tier-2 Expert MSN Tier-2 Expert MSN Tier-2 Expert EGEE 1.0 1.0 1.0 1.0 1.0 1.0 2.0 1.5 3.5 0.0 3.5 3.5 0.5 1.5 2.0 0.5 24.5 GridPP2+ GridPP EGEE 1.0 1.0 1.0 1.0 1.0 1.0 2.0 1.5 3.5 0.0 3.5 3.5 0.5 1.5 2.0 0.0 24.0 GridPP3 GridPP3 FY08 FY09 GridPP3 FY10 2.5 2.0 2.0 Workload Performance and Portal 3.0 3.0 3.0 Data Management 3.0 2.5 2.0 Storage 3.8 3.8 3.8 Security 5.0 3.0 2.0 InfoMon 1.0 1.0 1.0 Networking 0.0 0.0 0.0 18.3 15.3 13.8 D. Britton Grid Operations Team of 8.5 FTE consisting of: - 1 Production Manager; - 4 Tier-2 Coordinators; - 3 to run the UK/GridPP Grid Operations Centre (GOC). - 0.5 FTE to coordinate technical documentation. Responsible for the deployment, operation, and support of UK Particle Physics environment. Production Manager is responsible for resolving technical and coordination issues that span the Tier1 and Tier2s and ensuring a stable production services with appropriate upgrades to improve functionality and quality. The current GOC (5.5 FTE funded by EGEE) is responsible for monitoring the world-wide Grid operations, providing trouble tickets, accounting services, and administrative tools. 6/September/2006 GridPP3 D. Britton Operations Posts Area Role FY08 FY09 FY10 Grid Deployment Production Manager 1.0 1.0 1.0 Tier-2 Technical Coordinators 4.0 (one for each of the 4 regional centres) 4.0 4.0 Technical Documentation 0.5 0.5 0.5 Monitoring of LCG operations in 1.0 the UK 1.0 1.0 Grid Accounting 1.0 1.0 1.0 International Coordination 0.5 0.5 0.5 Security Risk Management 0.5 0.5 0.5 8.5 8.5 8.5 Grid Operations TOTALS 6/September/2006 GridPP3 D. Britton Budget Overview Cost Table Tier-1 Tier-1 50% Travel+Other 2% Outreach 1% Management 3% 6/September/2006 Tier-2 25% Operations. 6% Support 13% GridPP3 Staff Hardware Tier-2 TOTAL [£m] 4.99 11.72 Staff 3.29 Hardware 5.12 Grid Support Staff 4.50 Grid Operations Staff 1.89 Management 1.17 Outreach 0.37 Travel and Other 0.84 Project Sub-Total 33.90 Working Allowance (4%) 1.25 Project Cost 35.15 Contingency (12%) 4.15 Tier-1 Running Costs 2.50 Full Approval Cost 41.80 D. Britton GridPP3 Structure Wind Review Provision Collaboration Board (CB) Project Management Board (PMB) Deployment Board (DB) Utilisation User Board (UB) React Earth 6/September/2006 GridPP3 D. Britton Fire Water Oversight Committee (OC) Management Continuity GridPP2 GridPP2+ Post GridPP3 FY08 FY09 FY10 Post TD Project Leader 0.67 0.67 0.90 0.90 0.90 Project Leader DB Project Manager 0.90 0.90 1.00 1.00 1.00 Project Manager SL T2 Coordinator 0.50 0.50 DB Chair 0.30 0.30 0.40 0.40 0.40 UB Chair 0.00 0.00 0.25 0.25 0.25 UB Chair Middleware Coordinator 0,50 0.50 Application Coordinator 0.50 0.50 0.40 0.40 0.40 CCLRC Management 0.50 0.50 0.50 0.50 0.50 CCLRC Management Sub-Total 3.87 3.87 3.45 3.45 3.45 6/September/2006 GridPP3 Deployment Coordinator Technical Coordinator D. Britton SP Outreach Currently a Dissemination and an Events Officer (1.5 FTE). Instructions in the PPARC call include the statement: “It is expected that a plan for collaboration with industry will be presented or justification if such a plan is not appropriate.” Therefore, broaden mandate to include industrial liaison without increasing manpower but add 0.5 FTE to this area from current documentation officer to handle user documentation and web-site maintenance. Overall team of 2 FTE responsible for: -Dissemination activities (news, press-releases, liaison with partners, etc.) -Event organisation (demos, publicity, etc.) -Industrial liaison (to be developed.) -Basic user documentation and website maintenance. 6/September/2006 GridPP3 D. Britton GridPP3 Posts GridPP2 GridPP2 GridPP Management Tier-1 Tier-2 All Management posts All Tier-1 Services Hardware Support Specialist Posts Middleware All MSN Posts Applications All Application Posts Operations Manager Operations Tier-2 Coordinators GOC Posts Documentation Dissemination Documentation Officer Dissemination + Events TOTAL 6/September/2006 3.87 13.50 9.00 5.50 13.00 18.50 1.00 0.00 0.00 GridPP2+ EGEE 1.50 3.50 4.00 5.50 1.00 1.50 81.37 GridPP 3.87 16.00 9.00 5.00 13.00 1.00 1.00 0.00 0.00 EGEE GridPP3 FY08 FY09 GridPP3 FY10 3.45 3.45 3.45 18.00 18.00 18.00 14.75 14.75 14.75 1.50 3.50 18.30 15.30 13.80 1.00 1.00 1.00 4.00 4.00 4.00 4.00 5.50 3.00 3.00 3.00 0.50 0.50 0.50 1.00 0.50 0.50 0.50 1.50 1.50 1.50 1.50 65.87 65.00 62.00 60.50 GridPP3 D. Britton Management Tier-1 Tier-2 Support Operations Outreach TOTAL Travel and Other Costs Based on experience in GridPP2 we have budgeted £3.5k per FTE per annum for travel, a reduction of about 10%, to cover collaboration meetings, national and international conferences and workshops, technical meetings, management meetings, etc. “Other Costs” of £15k per annum have been included for outreach expenses and other operational expenses (licences, laptops, test machines, web server, software etc). 6/September/2006 GridPP3 D. Britton Total Costs [k£] Work Package A Tier-1 Staff B Tier-2 D Operations E Management F Outreach and Other Total 6/September/2006 FY08 FY09 FY10 Total 693.47 1384.21 1432.66 1482.80 4993 Hardware 3810.50 2621.22 3025.76 2265.54 11723 Staff C Support G Travel Costs FY07 147.84 1008.59 1047.87 1088.90 3293 Hardware 1163.24 1294.63 1382.60 1281.84 5122 695.81 1416.04 1232.93 1155.33 4500 43.34 592.72 614.84 637.85 1889 194.35 311.78 324.97 338.78 1170 59.97 99.72 103.58 107.61 371 134.48 242.50 232.00 226.75 836 6943.00 8971.41 9397.21 8585.39 33897 GridPP3 D. Britton Risks ID R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R20 R22 R25 R26 R27 R28 R29 R30 R31 R32 R33 R35 R36 R37 R38 R39 R40 R41 R42 R43 R44 Alt-i-r Name Recruitment/retention difficulties Sudden loss of key staff Minimal Contingency GridPP deliverables late Sub-components not delivered to project Non take-up of project results Change in project scope Bad publicity External OS dependence External middleware dependence Lack of monitoring of staff Withdrawal of an experiment Lack of cooperation between Tier centres Scalablity problems Software maintainability problems Technology shifts Repitition of research Lack of funding to meet LCG PH-1 goals Conflicting software requirements Hardware resources inadequate Hardware procurement problems LAN Bottlenecks Tier-2 organisation fails Experiment Requirements not met SYSMAN effort inadequate Firewalls interfere with Grid Inablility to establish trust relationships Security inadequate to operate Grid Interoperability Failure of international cooperation e-Science and GridPP divergence Institutes do not embrace Grid Grid does not work as required Delay of the LHC Lack of future funding Network backbone failure Network backbone bottleneck Network backbone upgrade delay Inadequate User Support 6/September/2006 GridPP Li Im Risk Li 2 1 2 2 4 1 3 3 1 2 3 3 2 3 3 6 1 2 1 1 LCG MSN Apps Pro. Grid Im Risk Li Im Risk Li Im Risk Li Im Risk 2 4 2 2 4 2 2 4 2 2 4 3 3 1 3 3 1 3 3 1 4 4 1 2 2 1 3 3 2 3 6 3 2 6 2 2 2 3 6 2 3 6 2 3 6 1 2 1 4 4 2 2 4 1 4 4 1 1 2 2 4 3 3 1 3 3 2 3 6 4 1 1 2 1 2 1 2 2 4 2 2 2 2 8 2 4 4 2 4 2 4 1 4 3 3 3 2 9 6 2 2 2 3 4 4 2 2 1 3 2 3 4 3 6 3 3 2 2 9 4 2 2 2 2 4 4 1 4 4 3 2 2 2 3 4 6 1 3 3 2 4 3 6 12 1 3 2 1 2 1 2 2 Name Likelihood (1-4) Impact (1,2,3,5) Risk (L x I) 4 8 4 6 4 2 3 3 3 3 2 4 3 3 3 9 6 3 4 4 6 6 3 3 2 3 9 6 1 3 3 Insufficient funding. 3 5 15 Present requirements. PPRP to advise on strategic priorities. 2 Hardware costing. Hardware prices don’t fall as anticipated. 2 3 6 Delay if possible or de-scope if necessary. 3 Tier-2 market fails. 2 3 6 Increase Tier-2 hardware price and/or Tier-2 staffing level. 4 Tier-1 fails to meet service level required. 3 2 6 Increase Tier-1 staffing level. 5 Tier-2s fail to meet service level required 3 2 6 Increase Tier-2 staffing level. 6 Middleware fails. 2 3 6 Mitigated by experiment specific solutions. Work with partners to address shortcomings. Re-target support effort. 7 Industrial take-up low. 3 1 3 Facilitated by Industrial Liaison post. 8 Outreach fails. 1 2 2 Appoint Dissemination Officer. 9 Staffing missing/unqualified. 1 3 3 Build on existing expertise. Assume likelihood is low if early approval of GridPP2 extension. 10 Organisational problems. 1 3 3 Define/build/agree GridPP3 structure. Clarify the role of GridPP3 and its interactions. 11 Technical Risks - See GridPP2 risks: R9, R10, R13, R14, R16, R22, R25, R27, R36, Also, physical risks. * * * Develop a full GridPP3 Risk Register based on that from GridPP2. 12 Inadequate support infrastructure. 2 2 4 Monitor performance of support activities via pre-defined metrics. 13 Lack of interoperability. 2 2 4 Active engagement in NGS, GGF, WLCG, EGEE. 14 Security compromise. 3 3 9 Work with other e-Infrastructure providers. Limit capability through portals. Key part of user training. 2 Adopt a conservative approach to technology deployment. 8 2 3 6 3 3 9 0 2 1 1 4 2 4 4 1 4 4 4 GridPP3 Action to Mitigate Risk 1 6 2 1 2 4 2 2 1 4 2 2 # D. Britton Working Allowance and Contingency Item Contingency [£m] Working Allowance [£m] 0.478 (d) Tier-1 Staff Tier-1 Hardware 1.758 (a) Tier-2 Staff 0.853 (b) Tier-2 Hardware 1.537 (c) 0.426 (e) Management Staff 0.049 Operations 0.300 TOTAL 4.148 1.253 (a) 15% of Tier-1 HW (cost uncertainties). (b) 4 FTE at Tier-2 (market approach). (c) 15% of Tier-2 HW (cost uncertainties) + 15% (market approach). (d) 2 FTE at Tier-1`(service level). (e) 2 FTE at Tier-2`(service level). 6/September/2006 GridPP3 D. Britton Total Project Cost Work Package A B Cost Table Tier-1 Tier-2 Staff Hardware Staff Hardware TOTAL [£m] 4.99 11.72 Tier-2 25% 5.12 C Grid Support Staff 4.50 D Grid Operations Staff 1.89 E Management 1.17 F Outreach 0.37 G Travel and Other 0.84 Project Sub-Total 33.90 Working Allowance (4%) 1.25 Project Cost 35.15 Contingency (12%) 4.15 Tier-1 Running Costs 2.50 Full Approval Cost 41.80 6/September/2006 Tier-1 50% 3.29 Travel+Other 2% Outreach 1% GridPP3 Management 3% Operations. 6% D. Britton Support 13% Responses to Referee Questions 6/September/2006 GridPP3 D. Britton Exclusivity? “There is clearly a compelling advantage for the physicists concerned to be aligned with and pool resources with the rest of the global alliance that comprises LCG. However, this does not need to be an exclusive alliance.” “long-term operational costs, quality of service and interdisciplinary collaboration could surely be improved by a much more integrated and synergistic approach.” • GridPP has engaged with wider community (and has reported this to PPARC through RCUK annual reports) • GridPP’s first Grid application was GEANT-based for LISA • Community is however focussed on its scientific priorities: LHC start-up timescale provides the primary focus 6/September/2006 GridPP3 D. Britton Outsourcing? “companies are developing expertise in service hosting and provision with many opportunities to develop experts, teams, resource management systems and operational/business knowledge.” • GridPP has engaged with BT (visits to hosting site in St Albans, meeting with BT management at IC) and discussed possibilities fully in the past. • Recent IT outsourcing exercises at Bristol and Cambridge indicate that costs are prohibitive (but that these may be offset by a joint PR programme). 6/September/2006 GridPP3 D. Britton Novel? Original? Timely? “novelty is entirely inappropriate when the goal is a highly reliable, ubiquitous and always available eInfrastructure” “similar undertakings of various scales are underway in many countries” • GridPP notes that many of the methods used have not been tested at the required scale “The LHC is likely to start producing data by 2007 and the proposed e-Infrastructure must be ready by that date if UK PP is to benefit from that data.” 6/September/2006 GridPP3 D. Britton Relationships? “the PP grid community has not yet engaged in collaboration on standardising data replication, data location, caching and reliable data movement services.” • Globus RLS was based on earlier collaboration with EDG, inc. GridPP input • GridPP plans to include higher level replication services, built on current expertise 6/September/2006 GridPP3 D. Britton Reliable methods? “In house development of middleware and tools is almost certainly inappropriate” • GridPP agrees and, hence, the focus is on support and maintenance of existing components, with planned reductions in manpower • Appendix A2 Middleware Support Planning Document expands upon the identified components as either “mission critical” to UK exploitation or as part of the UK’s input in the wider international context or it is possible to demonstrate leverage 6/September/2006 GridPP3 D. Britton Industrial relevance? • “significant technology transfer depends on long-term and sustained collaboration where mutual understanding develops and co-adaptation follows” • GridPP agrees: we are proposing a dedicated 0.5FTE in this area and believe this will represent good value at this level 6/September/2006 GridPP3 D. Britton Viability? “There is a significant risk that the gLite stack will prove incapable of development for large scale, wide-spread and production QoS use. It is already very complex..” • GridPP agrees that there is a risk, but the expanded use of gLite across an everincreasing infrastructure indicates that these problems are being overcome “It is better than it was but it by no means free from risk and misdirection.” 6/September/2006 GridPP3 D. Britton Planning? “The proposal states that “A future EGI project, including particle physics as one of the leading applications, may have started”. There are other future scenarios. One is the model already used in GÉANT..” • GridPP agrees that e.g. UKERNA could have been asked to “manage the Grid”, but this is not currently planned • Our intention is to (continue to) engage fully with the NGS and other bodies as discussed in appendix A7 National Context Document 6/September/2006 GridPP3 D. Britton Planning? “I would strongly recommend that a production eInfrastructure project should not use bespoke software.” • GridPP agrees – the reference was to experimentspecific code that is currently necessary to fill gaps in the middleware “It is essential to separate all forms of maintenance, especially bug fixing and “improvements” from operations and to conduct it in a software engineering environment with strict quality controls, testing and release procedures.” • GridPP agrees – the quality controls, testing and release procedures are of a high standard 6/September/2006 GridPP3 D. Britton Planning? “It is clear that a production service team should draw on others who should develop such services, not develop them themselves.” … “It is probably necessary to carry on some aspects of the above work, but these require very careful selection and they should be collaborative with other disciplines and grid projects, and include strategies where the development and maintenance is eventually handed over to others.” • GridPP agrees – in the GridPP3 proposal we discuss a very limited subset of maintenance and support developments that were proven to be necessary (and were effective) in the past or can be envisaged to be required in future c.f. “Storage management is an area where there is already good international collaboration led by the PP community on standards and implementations using the SRM specifications” 6/September/2006 GridPP3 D. Britton Past effectiveness? “The previous two GridPP projects have taken on demanding and challenging engineering, management and R&D tasks. They have been exceptionally successful, as establishing and running grid services on this scale requires world-leading innovation. This has required professional leadership and adept collaboration. There is plenty of evidence of their ability and the advent of LHC data will guarantee their motivation. Their particular strengths are in service management, deployment and operation on a global scale.” • GridPP agrees 6/September/2006 GridPP3 D. Britton Suitability “The two previous GridPP projects have demonstrated that they are capable of recruiting, sustaining and managing such a multi-site team. There is likely to be a substantial carry forward of the GridPP2 team. Can you quantify the level of continuity that the project depends on and the assessment of the risk that this continuity will not be met?” • GridPP agrees – there is a significant risk that the current expertise will be lost due to planning uncertainty. This was addressed in the proposal by the request for early approval of the GridPP2 continuation component. 6/September/2006 GridPP3 D. Britton Reduce number of Tier-2 sites? “It might be helpful to review carefully whether long-term savings can be made by concentrating Tier-2 resources over fewer sites. Currently table 10 shows 17 sites for Tier-2 resources. Is there really a case for resources at each of these sites?” • All institutes have delivered on their past MoU commitments (past performance was factored into the proposed sharing of Tier-2 resources) • If PPARC chose to invest at a small subset of sites, then significant long-term buildings and infrastructure investment would be required (that has not been planned) • In addition the utility costs of these would be exposed (currently hidden) • If PPARC chose to select a larger subset of sites, there would be limited gains “Possibly leveraging SRIF funding is a consideration.” 6/September/2006 GridPP3 D. Britton Cost-effectiveness ““matching funding” is not a justification” (for 7-month GridPP2 continuation in the context of EGEE-II) • The main case is built upon GridPP2 completing its mission to establish a Production Grid, prior to LHC data-taking mode • This enables retention of key staff whilst planning for the Exploitation phase in GridPP3 6/September/2006 GridPP3 D. Britton Code efficiency improvements? “How do you trade between investing in software engineering to improve code performance against investing in more CPU?” • LHC experiment codes are already highly optimised for the complex data analysis required • There is significant investment in the optimisation effort within the experiments and the requirements take into account future optimisations • The optimisations take account the (distributed) Grid computing constraints 6/September/2006 GridPP3 D. Britton Usage increases? ““use by a much larger community intent on individual analyses” requires further justification. How do you demonstrate this community will actually engage and actually generate this additional use?” • The experiment requirements anticipate increasing analysis across the experiments • This is quantified by experiment in the proposal appendices 2. 3. 4. 5. “ALICE Computing Technical Design Report”, lhcc-2005-018.pdf, 114pp. “ATLAS Computing Technical Design Report”, lhcc-2005-022.pdf, 248pp. “CMS: The Computing Project Technical Design Report”, lhcc-2005-023.pdf, 169pp. “LHCb Computing Technical Design Report”, lhcc-2005-019.pdf, 117pp. 6/September/2006 GridPP3 D. Britton Data Management? “Companies such as Oracle and IBM supply well-honed distributed database technologies capable of high volume and high throughput. Developing PP-specific and home grown solutions is very unlikely to be cost effective.” • Oracle are fully incorporated into LCG planning, with (low cost) Worldwide Oracle database services used for core metadata functions 6/September/2006 GridPP3 D. Britton Tier-2 additional support? • 1. 2. • • • • “Table 12 appears to identify an anomaly that suggests that the plan is not as cost effective as it should be.” Tier-2 support effort is currently cross-subsidised through: the PP rolling grant programme; Institute (e.g. computing service) support Component 1 was anticipated not to be viable Component 2 was modest, but is expected to continue at ~this level We have requested Contingency to cover the possibility that component 2 is not preserved (15% on the hardware cost in addition to another 15% that covers the future price uncertainty; plus an additional 4 FTE - 1 at each Tier-2) We have also requested Working Allowance of an additional 2 FTE at Tier-2s to be used if the service level falls short 6/September/2006 GridPP3 D. Britton Context Planning? “The development of this interdependency and cooperation should be explicitly planned and specified from the start of GridPP3.” e.g. ““forms part of the National e-Infrastructure” – what part?” ““CA” LCG uses one system…” ““training” What source of training is this?” • All plans are integrated with NGS and EGEE in these areas and expanded upon in appendix A7 National Context Document 6/September/2006 GridPP3 D. Britton Overall Scientific Assessment “This proposal is fundable and should be funded. Because of its significance to an extensive research community a decision to proceed should be made quickly.” • GridPP agrees • The outline answers provided to the referees’ questions are provided in anticipation of such a PPRP decision 6/September/2006 GridPP3 D. Britton Referee 2 • • • • • • • • • • • • • Proposal Details: Reference number: PP/E00296X/1, Grant panel: Projects peer review panel, Grant type: Standard. The Proposal: Science quality: I really cannot comment on the pure science, not being a particle physicist. The proposal itself deals with deploying and operating a production GridPP, and as such is mostly infrastructural engineering and computer science of a software engineering flavour, rather than pure research. This is as it should be for a proposal of this type. In this sense the proposal is of a high quality. It is of course worthwhile in that it will be impossible for the UK particle physics community to fully engage with the LHC without GridPP3. Objectives: The grand objectives are clear enough in the executive summary, the more detailed objectives are distributed throughout the proposal, and perhaps could benefit from a summary tabulation. The objectives are sound but ambitious to an extent that perhaps threatens availability. Management: Based on GridPP2, appears to work well. Program Plan: Timescales & milestones hard to find. Significance: This is a very significant infrastructure for the future of particle physics in the UK. c/f Other Work: GridPP has performed very well in the EU context, and also in experimental transatlantic work, and is a central partner in EGEE. The proposed infrastructure is a part of an overall global grid required for LHC. Methodology: A continuation and expansion from GridPP2, and likely to be successful if the manpower resources are adequate to the task. Industry: Limited proposals. Planning: The related planning documents exhibit a good degree of coherency. Past Record: The past performance has been good to excellent. Suitability: Very suitable. 6/September/2006 GridPP3 D. Britton Project Plan? “Timescales & milestones hard to find.” • The intention is to use the project management methods used (successfully) in GridPP1 and GridPP2 • The approach taken to GridPP3 is different to that of GridPP1(2) planning • A set of high-level deliverables can be prepared in the light of PPRP feedback, if requested 6/September/2006 GridPP3 D. Britton Backup Slides 6/September/2006 GridPP3 D. Britton GridPP2 ProjectMap Production Grid Milestones 0.1 0.18 0.35 0.52 0.2 0.3 0.19 0.20 0.36 0.37 0.53 0.54 1 LCG 0.4 0.5 0.21 0.22 0.38 0.39 0.55 0.56 Design 1.1 1.1.1 1.1.2 1.1.3 1.1.4 1.1.5 Service Challenges 1.2 1.2.1 1.2.2 1.2.3 1.2.4 1.2.5 0.6 0.7 0.23 0.24 0.40 0.41 0.57 0.58 Development 1.3 1.3.1 1.3.2 1.3.3 0.8 0.9 0.25 0.26 0.42 0.43 0.59 0.60 0.10 0.27 0.44 0.61 0.11 0.28 0.45 0.62 2 M/S/N Metadata 2.1 2.1.1 2.1.2 2.1.3 2.1.4 2.1.5 2.1.6 2.1.7 2.1.8 2.1.9 2.1.10 2.1.11 2.1.12 Storage 2.2 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 2.2.6 2.2.7 2.2.8 2.2.9 2.2.10 2.2.11 2.2.12 2.2.13 2.2.14 2.2.15 Workload 2.3 2.3.1 2.3.2 2.3.3 2.3.4 2.3.5 2.3.6 2.3.7 2.3.8 2.3.9 2.3.10 2.3.11 Security 2.4 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 2.4.6 2.4.7 2.4.8 2.4.9 2.4.10 2.4.11 2.4.12 2.4.13 2.4.14 2.4.15 Metric OK Navigate down External link Other Link 88 (91%) 0.14 0.15 0.31 0.32 0.48 0.49 0.16 0.17 0.33 0.34 0.50 0.51 ATLAS 3.1 3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 3.1.6 3.1.7 3.1.8 3.1.9 3.1.10 3.1.11 3.1.12 3.1.13 GANGA 3.2 3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.2.6 3.2.7 LHCb 3.3 3.3.1 3.3.2 3.3.3 3.3.4 3.3.5 3.3.6 3.3.7 3.3.8 3.3.9 3.3.10 3.3.11 3.3.12 3.3.13 0.100 0.101 0.102 0.103 0.104 0.105 0.106 0.107 0.108 0.109 0.110 0.111 0.112 0.113 0.114 0.115 0.116 0.117 0.118 0.119 0.120 0.121 0.122 0.123 0.124 0.125 0.126 0.127 0.128 0.129 0.130 0.131 0.132 0.133 0.134 0.135 0.136 0.137 0.138 0.139 0.140 0.141 0.142 0.143 0.144 0.145 0.146 0.147 3 LHC Apps CMS 3.4 3.4.1 3.4.2 3.4.3 3.4.4 3.4.5 3.4.6 3.4.7 3.4.8 3.4.9 3.4.10 3.4.11 3.4.12 3.4.13 3.4.14 3.4.15 InfoMon PhenoGrid Metric Tasks Tasks 2.5 3.5 2.5.2 2.5.3 2.5.4 2.5.5 3.5.1 3.5.2 3.5.3 3.5.4 3.5.5 not2.5.1 OK Complete Overdue 2.5.6 2.5.7 2.5.8 2.5.9 2.5.10 3.5.6 3.5.7 3.5.8 3.5.9 9 103 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.6.6 2.6.7 2.6.8 2.6.9 2.6.10 2.6.11 2.6.12 2.6.13 6/September/2006 (40%) 4 Non-LHC Apps BaBar 4.1 4.1.1 4.1.2 4.1.3 4.1.4 4.1.5 4.1.6 4.1.7 4.1.8 4.1.9 4.1.10 4.1.11 4.1.12 SamGrid 4.2 4.2.1 4.2.2 4.2.3 4.2.4 4.2.5 4.2.6 4.2.7 4.2.8 4.2.9 4.2.10 4.2.11 4.2.12 4.2.13 4.2.14 4.2.15 Portal 4.3 5 Management 6 External Project Planning 5.1 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 5.1.7 5.1.8 5.1.9 5.1.10 5.1.11 5.1.12 6.1.1 6.1.2 6.1.3 6.1.4 6.1.5 6.1.6 6.1.7 6.1.8 6.1.9 Project Execution 5.2 5.2.1 5.2.2 5.2.3 5.2.4 5.2.5 5.2.6 5.2.7 5.2.8 5.2.9 5.2.10 5.2.11 5.2.12 5.2.13 5.2.14 5.2.15 Dissemination 6.1 Interoperability 6.2 6.2.1 6.2.2 6.2.3 6.2.4 6.2.5 6.2.6 6.2.7 6.2.8 6.2.9 6.2.10 6.2.11 6.2.12 6.2.13 6.2.14 Engagement 6.3 4.3.1 4.3.2 4.3.3 4.3.4 4.3.5 4.3.6 4.3.7 4.3.8 4.3.9 4.3.10 4.3.11 4.3.12 4.3.13 6.3.1 6.3.2 6.3.3 6.3.4 6.3.5 UKQCD 4.4 Knowledge Transfer 6.4 4.4.1 4.4.2 4.4.3 4.4.4 4.4.5 4.4.6 4.4.7 4.4.8 4.4.9 6.4.1 6.4.2 6.4.3 6.4.4 Tasks dueStatus in Date -Items Tasks 31/Dec/05 + next 60 Days next 60 daysMonitor OKInactive1.1.1 not Due 2.5.11 Network 2.6 Production Grid Metrics 0.12 0.13 0.29 0.30 0.46 0.47 7 LHC Deployment 3.6 3.6.1 3.6.2 3.6.3 3.6.4 3.6.5 3.6.6 3.6.7 3.6.8 3.6.9 3.6.10 GridPP3 16 Monitor not OK Milestone complete Milestone overdue Milestone due soon Milestone not due soon Item not Active 20 1.1.1 1.1.1 1.1.1 1.1.1 1.1.1 1.1.1 Update 132 Clear D. Britton Change Forms 37 Convergence with NGS 6.2 Owner: Number 6.2.1 6.2.2 6.2.3 6.2.4 6.2.5 6.2.6 6.2.7 6.2.8 6.2.9 6.2.10 6.2.11 6.2.12 6.2.13 6.2.14 6.2.15 Interoperability Neil Geddes Title Common GridPP/NGS helpdesk and problem tracking infrastructure Joint GridPP/NGS plan for web services deployment First jointly supported service Common security policy Integrated plan for Grid support in the UK beyond 2007 First stage connection of GridPP sites to NGS Second stage connection of GridPP sites to NGS Final stage connection of GridPP sites to NGS Number of NGS representatives on GridPP committees GridPP attendance at NGS committee meetings Number of Non-HEP applications tested on GridPP Grid Number of GridPP members attending GGF meetings Number of GridPP members in charge of formal GGF Working Groups Implemented Common Security Policy Status Date 30-Jun-06 Due 01-Oct-06 01-Jan-06 01-Jun-05 01-Jan-05 01-Apr-06 01-Apr-05 01-Apr-06 31-Aug-07 On going On going On going On going On going 01-Nov-05 Status Complete In Progress Complete Complete In Progress In Progress Not Started Not Started OK OK OK OK OK Complete - The slow emergence of real web-services solutions means that 6.2.2 will probably not be completed during GridPP2. - GridPP is committed to gLite and NGS intends to be compatible with this but can not deploy the full gLite stack. - GridPP collaboration is discussing formal affiliation with NGS and presently Edinburgh are NGS affiliates and Oxford, RAL, Manchester, and Lancaster are partners. Discussions underway with Glasgow, UCL, and IC. 6/September/2006 GridPP3 D. Britton In the Beginning… The UK Grid for HEP really started to grow in 2000 with the release of the Hoffman report into LHC computing requirements and the results of the UK Government Spending Review (SR2000) which targeted £80m for e-Science. Academic Application Support Programme Research Councils (£74m), DTI (£5m) PPARC (£26m) BBSRC (£8m) MRC (£8m) NERC (£7m) £80m ESRC (£3m) EPSRC (£17m) CLRC (£5m) Generic Challenges EPSRC (£15m), DTI (£15m) Collaborative projects Industrial Collaboration (£40m) 6/September/2006 GridPP3 D. Britton Hardware Costs Storage Cost 10 Kryder’s Law for disk cost 10 9 8 5 6 Price/TB (K) LN(Price/PB) 4 4 3 Estimated Price per KSI2K 2 2 1.8 8 1 Moore’s Law for CPU cost 1.6 0 0 Hardware costs extrapolated from recent purchases. However, experience tells us there are fluctuations associated with technology steps. Significant uncertainty in integrated cost. 6 1.2 5 1 4 0.8 3 0.6 2 0.4 1 0.2 0 0 2002 6/September/2006 7 1.4 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Price/KSI2K (£K) £K 6 LN(£K) 7 8 Model must factor in: - Operational life of equipment - Known operational overheads - Lead time for delivery and deployment. GridPP3 2003 2004 2005 2006 2007 Price/KSI2K (£K) 2008 2009 LN(Price/MSI2K) D. Britton 2010 2011 2012 LN(Price/MSI2K) 12 Hardware Costs: Tape 2007 2008 out beam total required bandwidth Total available CASTOR bandwidth CAPACITY MODEL Required capacity Actual CASTOR Capacity 816 544 2538 2538 9940 Media Existing 9940 Slot Count Media Capacity (9940) Existing 9940 Capacity Spent on new Robot Infrastructure New Slots Purchased Maximum Slot Count Available Total Used Slots In beam write capacity required Out Beam write capacity required In beam read capacity required Out Beam read capacity required 6/September/2006 7682 10516 9753 10129 9940B Drives 9940B Maintainance Cost/drive 0 0 0 0 Spent on 9940B Maintainance 0.182 0.182 0.182 0.182 Bandwidth per brick (MB/s) 0.000 0.000 0.000 0.000 9940B Bandwidth 816 0 430 194 350 778 1208 0.08 0.45 62 2538 0 1208 544 1994 4432 5639 0.07 0.45 310 5000 3156 250 6000 11000 5639 Cost of Storage Brick (T20) T20K Maintainance Cost/drive New T20K 50 Server Bricks 50 Total T20K Server Bricks 2000 2000 11000 13000 13000 Bandwidth per brick (MB/s) 15000 10684 11684 Spent on Server 11254 bricks 13428 Spent on T20K Maintaince Total T10K Bandwidth 114 228 17 114 252 25% 25% 151 191 137 155 301 275 309 Spent on381 ADS Maintainance 319 Spent on Minor Parts 151 191 137 155 Spent on764 Robot 1 M&O 478 970 1202 Spent on25% Robot 2 25% M&O 25% 25% 25% 25% 25% 25% 32 25% 25% 327 0 174 337 Cost of Storage Brick (T10) 4808 7682 9753 12085 T10K Maintainance Cost/drive778 0 0 430 New T10K Server11254 Bricks 10476 5639 10684 Total T10K Server10129 Bricks 2538 9616 9429 Bandwidth per 2270 0 brick (MB/s) 0 2656 5045 0 2951 Spent on1000 Server bricks 10684 11684 11254 13428 Spent on T10K Maintaince 0.06 0.06 0.06 0.06 Total T10K Bandwidth 0.45 0.9 0.9 0.9 303 60 0 177 401 933 Summary 0 0 201 679 Spent on1019 Media 638 366 0 183 1294 555 501 337 640 602 638 720 1613 1019 1680 549 1294 1320 619 1603 1680 6 3.0 18 25 150 3 3.3 9.9 25 75 0 3.6 0 25 0 0 4.0 0 25 0 0 4.4 0 25 0 0 4.8 0 25 0 19.15 2.3 3 6 80 57.45 6.9 480 19.15 2.3 2 8 80 38.3 13.8 640 19.15 2.3 1 9 80 19.15 18.4 720 19.15 2.3 0 9 80 0 20.7 720 19.15 2.3 19.15 2.3 0 80 0 0 80 0 0 0 19.15 2.3 3 11 120 57.45 18.4 1320 19.15 2.3 3 14 120 57.45 25.3 1680 12085 12085 0 0.182 0.000 Bandwidth MODEL Estimated rate to Fill (6 months) In beam Double Fill Rate In beam Media Conversion (6 months) In beam reprocessing Out of beam Reprocessing Read Rate (4 months?) Drive deadtime on writes Drive deadtime on Reads 4808 4808 1948 0.182 324 T10/20K Media Total Required Tape Capacity April (TB) Tapes phased out in March Total Tapes Available in March Total Storage Capacity (March) Addirtional TB Required for April Additional Tapes Purchased Used Slots April (T10/20K) T10/20K Media Cost Media Capacity Spent on Media 2009 2011bandwidth 2012 In beam 2010 total required GridPP3 412 0 206 1603 Spent on Bandwidth and Operation Spent Total 0 0 0 0 0 0 0 0 0 19.15 2.3 8 8 120 153.2 0 960 10 10 30 10 10 30 50 0 10 30 50 0 10 30 55 0 10 30 55 0 10 30 60 62 132 195 310 412 722 60 319 379 0 189 189 177 258 435 303 D. Britton 128 430 Running Costs Running Costs CPU New Systems New Racks Phased out racks Rack Count KW/New System Disk 2007 2008 2009 2010 2007 2008 2009 2010 166 5 4 18 0.26 761 24 3 39 0.26 198 18 330 404 13 5 47 0.27 110 51 390 473 15 0 61 0.29 136 0 525 101 14 3 32 0.735 201 29 4 57 0.77 155 14 257 82 12 0 69 0.81 66 0 323 134 19 10 78 0.85 114 49 388 0.00008 0.00008 0.00009 0.00008 0.00008 0.00009 £347k £430k £609k £270k £357k £450k New KW Phased Out KW Total Load (KW) Cost Per KW 151 Cost £0k 116 £0k (Work in progress) 6/September/2006 GridPP3 D. Britton Tier-1 Growth Now Spinning Disks Yearly disk failures ~2000 30-45 CPU Systems Yearly system failures ~550 35-40 Start of GridPP3 ~10,000 200-300? ~1800 120-130? End of GridPP3 ~20,000 400-600? ~2700 180-200? To achieve the levels of service specified in the MOU, a multi-skilled incident response unit (3 FTE) is proposed. This is intended to reduce the risk of overprovisioning other work areas to cope with long term fluctuations in fault rate. These staff will have an expectation that their primary daily role will be dealing with what has gone wrong. They will also provide the backbone of the primary callout team. 6/September/2006 GridPP3 D. Britton Tier-2 Allocations • Take each experiment’s CPU and Disk requirements (from Dave Newbold) • For each experiment – share out among Tier-2s • For each Tier-2 share out among institutes • Sum over experiments (maintains the correct CPU/Disk ratio) Sharing guided by: • Size of local community (number of Ac/Ph/PP) • Past delivery (KSI2K to date, Disk usage last quarter) • Current resources available 6/September/2006 GridPP3 D. Britton Tier-2 ‘Shares’ Physicists FTEs Tier-2 London LHC Only 40 26% Existing Resources 1Q06 KSI2K TB 1049.0 37.7 Delivery to date KSI2K Hrs 27% 1,348,236 Disk used 1Q06 TB 39% 17.9 21% Summary Min Max Ave 21% 39% 28% ~35% NorthGrid 33 22% 1783.1 132.2 48% 1,229,271 36% 34.2 40% 22% 48% 36% ~35% ScotGrid 14 9% 354.0 44.6 10% 187,443 5% 21.0 24% 5% 24% 12% ~10% SouthGrid Total 66 43% 152 6/September/2006 516.4 48.4 3702.5 262.9 15% 661,080 3,426,030 GridPP3 19% 13.4 15% 15% 43% ~20% 86.6 D. Britton 23% Example CMS Requirement in 2008 is 1800 KSI2K and 400 TB Tier-2 sharing matrix: (PMB/Tier-2 Board) ATLAS (Tier-2 Board) LHCb Other London 0.25 0.75 0.10 0.30 NorthGrid 0.50 0.00 0.20 0.40 ScotGrid 0.15 0.00 0.30 0.10 SouthGrid 0.10 0.25 0.40 0.20 ATLAS Institute sharing matrix: CMS CMS LHCb Other Brunel 0.00 0.10 0.00 0.15 Imperial 0.00 0.90 1.00 0.00 QMUL 0.70 0.00 0.00 0.60 RHUL 0.20 0.00 0.00 0.15 UCL 0.10 0.00 0.00 0.10 i.e Imperial ‘allocation’ is 1800 KSI2K (400 TB) x 0.75 x 0.9 = 1215 KSI2K (270 TB) 6/September/2006 GridPP3 D. Britton Crosscheck: Allocated CPU v 'Size ' Imperial Glasgow Liverpool Allocated CPU 25.0% 20.0% 15.0% 10.0% 5.0% 0.0% 0% Bristol 6/September/2006 5% 10% 15% 20% Size/Delivery/Current GridPP3 D. Britton Tier-2 Staff 6/September/2006 FTE 0.50 1.50 1.00 0.50 1.00 1.50 1.00 1.50 1.00 0.25 0.50 1.00 1.00 1.00 0.50 0.50 0.50 14.75 FTE % 3% 10% 7% 3% 7% 10% 7% 10% 7% 2% 3% 7% 7% 7% 3% 3% 3% 100% Allocated FTE v CPU Sheffield Manchester Imperial 12% 10% Allocated FTE Institute Brunel Imperial QMUL RHUL UCL Lancaster Liverpool Manchester Sheffield Durham Edinburgh Glasgow Birmingham Bristol Cambridge Oxford RAL PPD 8% 6% 4% 2% 0% 0.0% 5.0% Birmingham and UCL GridPP3 10.0% 15.0% Allocated CPU D. Britton 20.0% 25.0% Proposal Procedure Proposal Tier-1 Tier-2 Middleware Applications Management … Total £Am £Bm £Cm £Dm £Em 6/September/2006 Peer Review £Xm Proposal Tier-1 Tier-2 Middleware Applications Management … Total GridPP1/ GridPP2 £Am £Bm £Cm £Dm £Em £Xm Re-evaluation Tier-1 Tier-2 Middleware Applications Management … Total Allocate Institute Institute Institute Institute Institute … Total 1 2 3 4 5 £Fm £Gm £Hm £Im £Jm £XM GridPP3 £am £bm £cm £dm £em £YM GridPP3 Peer Review Apply for Grants Institute 1 Institute 2 Institute 3 Institute 4 Institute 5 … Total £YM £fm £gm £hm £im £jm Institute 1 £fm Institute 2 £gm Institute 3 £hm Is this still a Institute 4 £im Institute 5sensible £jm project? … Total £YM D. Britton GridPP3 Deployment Board In GridPP2, the Deployment Board is squeezed into a space already occupied by the Tier-2 Board; the D-TEAM; and the PMB. Many meetings have been “joint” with one of these other bodies. Identity and function have become blurred. Project Management Board X Deployment Board T1B Chair T2B Chair Prdn Mgr. Tier-1 Board Tier-2 Board D-Team Technical Coordinator Grp-1 Grp-2 … Grp-n In GridPP3, propose a combined Tier-2 Board and Deployment Board with overall responsibility for deployment strategy to meet the needs of the experiments. In particular, this is a forum where providers and users formally meet. Deals with: 1) Issues raised by the Production Manager which require strategic input. 2) Issues raised by users concerning the service provision. 3) Issues to do with Tier-1 - Tier-2 relationships. 4) Issues to do with Tier-2 allocations, service levels, performance. 5) Issues to do with collaboration with Grid Ireland and NGS. 6/September/2006 GridPP3 D. Britton GridPP3 DB Membership 1) Chair 2) Production Manager 3) Technical Coordinator 4) Four Tier-2 Management Board chairs. 5) Tier-1 Board Chair. 6) ATLAS, CMS, LHCb representatives. 7) User Board Chair. 8) Grid Ireland representative 9) NGS representative. 10) Technical people invited for specific issues. Above list gives ~13 core members, 5 of whom are probably on PMB. There is a move away from the technical side of the current DB and it becomes a forum where the deployers meet each other and hear directly from the main users. The latter is designed to ensure buy-in by the users to strategic decisions. 6/September/2006 GridPP3 D. Britton Grid Data Management Components: File transfer services. Metadata Catalogues. Services to manage the replication of data. Operational Support: FTS; metadata catalogues as they are deployed; replica optimisation services eventually. Maintenance: Metadata services and eventually replica optimisation services. Development: Common metadata services; Replica optimisation. 6/September/2006 GridPP3 D. Britton Storage Management Components: DPM (used at 12 Tier-2 sites in UK) dCache (used at Tier-1 and 7 Tier-2 sites in UK) CASTOR SRM1 (Tier-1 but to be phased out in 2006) CASTOR SRM2 (Tier-1 - primary developer). Operational Support: All above components. Hope to reduce number. Maintenance: GridPP “owns” dCache installation and configuration scripts within LCG, and the SRM2 interface to CASTOR. Development: None envisaged in GridPP3 era. However, SRM version-3 may impose some requirements 6/September/2006 GridPP3 D. Britton Information and Monitoring Components: R-GMA (information system slated to replace the BDII) Service Discovery (SD) APEL accounting (uses R-GMA) GLUE Schema (information model to define Grid resources) Operational Support: R-GMA Maintenance: R-GMA and SD. Development: R-GMA may still require development at start of GridPP3. Glue schema likely to require ongoing development (minor effort). 6/September/2006 GridPP3 D. Britton Workload, Performance and Portal Components: WMS (Resource Broker, Logging & Bookkeeping server etc). Tools to gather job information (used by ATLAS, CMS, and the RTM). Real Time Monitor (RTM). GridPP Portal. Operational Support: WMS, Job information repository. Job information analysis. Maintenance: WMS-testing, Job information scripts, RTM, Portal. Development: Portal (to address needs of new users); Job information scripts (to enrich/optimise content); (Possibly RTM if evolution still required/desired). 6/September/2006 GridPP3 D. Britton Security Components: - GridSite Toolkit (includes Grid Access Control Language GACL and GridSite’s Apache extension mod_gridsite both used by ATLAS and CMS) - VOMS Operational Support: GridSite and VOMS. Operational Security Officer Post. International Security Coordination Post. Maintenance: GridSite Development: GridSite 6/September/2006 GridPP3 D. Britton Networking Components: - High level contacts with JISC and UKERNA. - Requirements and provisioning - Work with providers in respect to interfaces to Grid – Network operations. - Network monitoring and diagnostic tools. Operational Support: Network monitoring and diagnostics. Maintenance: Minor. Development: None. 6/September/2006 GridPP3 D. Britton Active Users (All VOs) 6/September/2006 GridPP3 D. Britton Active Users by LHC experiment ALICE (8) ATLAS (70) CMS (150) LHCb (40) 6/September/2006 GridPP3 D. Britton Job success? Overview 6/September/2006 GridPP3 D. Britton Job Success by LHC experiment ALICE ATLAS CMS LHCb 6/September/2006 GridPP3 D. Britton