Massive Computing at CERN and lessons learnt Bob Jones CERN Bob.Jones CERN.ch WLCG – what and why? • A distributed computing infrastructure to provide.
Download ReportTranscript Massive Computing at CERN and lessons learnt Bob Jones CERN Bob.Jones CERN.ch WLCG – what and why? • A distributed computing infrastructure to provide.
Massive Computing at CERN and lessons learnt Bob Jones CERN Bob.Jones <at> CERN.ch WLCG – what and why? • A distributed computing infrastructure to provide the production and analysis environments for the LHC experiments • Managed and operated by a worldwide collaboration between the experiments and the participating computer centres • The resources are distributed – for funding and sociological reasons • Our task is to make use of the resources available to us – no matter where they are located Ian Bird, CERN 2 What is WLCG today? Collaboration Coordination & management & reporting Coordinate resources & funding Coordination with service & technology providers Common requirements Memorandum of Understanding Framework Service management Support processes & tools Service coordination Common tools World-wide trust federation for CA’s and VO’s Operational security Monitoring & Accounting Complete Policy framework Distributed Computing services Physical resources: CPU, Disk, Tape, Networks [email protected] 3 WLCG data processing model Tier-0 (CERN): •Data recording •Initial data reconstruction •Data distribution Tier-1 (11 centres): •Permanent storage •Re-processing •Analysis Tier-2 (~130 centres): • Simulation • End-user analysis 4 US-BNL CERN CaTRIUMF Amsterdam/NIKHEF-SARA Bologna/CNAF WLCG Collaboration Status Tier 0; 11 Tier 1s; 64 Tier 2 federations Taipei/ASGC Today we have 49 MoU signatories, representing 34 countries: Australia, Austria, Belgium, Brazil, Canada, China, Czech Rep, Denmark, Estonia, Finland, France, Germany, Hungary, Italy, India, Israel, Japan, Rep. Korea, Netherlands, Norway, Pakistan, Poland, Portugal, Romania, Russia, Slovenia, Spain, Sweden, Switzerland, Taipei, Turkey, UK, Ukraine, USA. NDGF US-FNAL 26 June 2009 De-FZK Barcelona/PIC Ian Bird, CERN Lyon/CCIN2P3 UK-RAL 5 Fibre cut during 2009: Redundancy meant no interruption Ian Bird, CERN 6 Worldwide resources • >140 sites • ~250k CPU cores • ~100 PB disk 7 Service quality: defined in MoU • MoU defines key performance and support metrics for Tier 1 and Tier 2 sites – Reliabilities are an approximation for some of these – Also metrics on response times, resources, etc. • The MoU has been an important tool in bringing services to an acceptable level Ian Bird, CERN 8 From testing to data: Independent Experiment Data Challenges 2004 SC1 Basic transfer rates Service Challenges proposed in 2004 To demonstrate service aspects: -Data transfers for weeks on end -Data management -Scaling of job workloads -Security incidents (“fire drills”) -Interoperability -Support processes e.g. DC04 (ALICE, CMS, LHCb)/DC2 (ATLAS) in 2004 saw first full chain of computing models on grids 2005 2006 SC2 Basic transfer rates SC3 Sustained rates, data management, service reliability SC4 Nominal LHC rates, disk tape tests, all Tier 1s, some Tier 2s • Focus on real and continuous production use of the service over several years (simulations since 2003, cosmic ray data, etc.) • Data and Service challenges to exercise all aspects of the service – not just for data transfers, but workloads, support structures etc. 2007 2008 CCRC’08 Readiness challenge, all experiments, ~full computing models 2009 STEP’09 Scale challenge, all experiments, full computing models, tape recall + analysis 2010 Ian Bird, CERN 9 Large scale = long times • LHC, the experiments, & computing have taken ~20 years to build and commission • They will run for at least 20 years • We must be able to rely on long term infrastructures – Global networking – Strong and stable NGIs (or their evolution) • That should be eventually self-sustaining – Long term sustainability - must come out of the current short term project funding cycles Ian Bird, CERN 10 Grids & HEP: Common history • CERN and the HEP community have been involved with grids from the beginning • Recognised as a key technology for implementing the LHC computing model • HEP work with EC-funded EDG/EGEE in Europe, iVDGL/Grid3/OSG etc. in US has been of clear mutual benefit – Infrastructure development driven by HEP needs – Robustness needed by WLCG is benefitting other communities – Transfer of technology from HEP • Ganga, AMGA, etc used by many communities now Ian Bird, CERN 11 European Grid Infrastructure • European Data Grid (EDG) – Explore concepts in a testbed • Enabling Grid for E-sciencE (EGEE) – Moving from prototype to production • European Grid Infrastructure (EGI) – Routine usage of a sustainable einfrastructure EGI-InSPIRE RI-261323 12 www.egi.eu European Grid Infrastructure (Status April 2011 – yearly increase) 13319 end-users (+9%) 186 VOs (+6%) ~30 active VOs: constant Logical CPUs (cores) – 207,200 EGI (+8%) – 308,500 All 90 MPI sites 101 PB disk 80 PB tape 25.7 million jobs/month – 933,000 jobs/day (+91%) 320 sites (1.4%) 58 countries (+11.5%) Non-HEP users ~ 3.3M jobs / month EGI - The First Year EGI-InSPIRE RI-261323 Archeology Astronomy Astrophysics Civil Protection Comp. Chemistry Earth Sciences Finance Fusion Geophysics High Energy Physics Life Sciences Multimedia Material Sciences … 13 www.egi.eu Grids, clouds, supercomputers, etc. Grids Supercomputers • Collaborative environment • Distributed resources (political/sociological) • Commodity hardware • (HEP) data management • Complex interfaces (bug not feature) • Communities expected to contribute resources • Scarce • Low latency interconnects • Applications peer reviewed • Parallel/coupled applications • Also SC grids (DEISA/PRACE, Teragrid/XD) Clouds Volunteer computing • Proprietary (implementation) • Economies of scale in management • Commodity hardware • Pay-as-you-go usage model • Details of physical resources hidden • Simple interfaces • Simple mechanism to access millions CPUs • Difficult if (much) data involved • Control of environment check • Community building – people involved in Science • Potential for huge amounts of real work 14 Collaboration with the General Public: Citizen Cyberscience Centre • • • • • Philosophy: promote web-based citizen participation in science projects as an appropriate low cost technology for scientists in the developing world. Partners: CERN, UN Institute for Training and Research, University of Geneva Sponsors: IBM, HP Labs, Shuttleworth Foundation Technology: open source platforms for internet-based distributed collaboration Projects: – – – • Computing for Clean Water optimizing nanotube based water filters by large scale simulation on volunteer PCs AfricaMap volunteer thinking to generate maps of regions of Africa from satellite images, with UNOSAT LHC@home new volunteer project for public participation in LHC collision simulations, using VM technology Plans: Training workshops in 2011 in India, China, Brazil and South Africa CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Frédéric Hemmer 15 Some more questions to be answered • Computing model – How many computing models exist in the community and can they all use the same computing infrastructure? • Continuous load or periodic campaigns? – How intensely and frequently will the community use the computing infrastructure? • Manpower – Do you have enough geeks to port the code and support it? • How committed is the community? – Are you prepared to contribute and share computing resources? Bob Jones – May 2011 16