OSG Overview for the Science Advisory Group Ruth Pordes Fermilab June 12th 2007 Goals of The OSG • Maintain the Distributed Facility Through a core.
Download ReportTranscript OSG Overview for the Science Advisory Group Ruth Pordes Fermilab June 12th 2007 Goals of The OSG • Maintain the Distributed Facility Through a core.
OSG Overview for the Science Advisory Group Ruth Pordes Fermilab June 12th 2007 Goals of The OSG • Maintain the Distributed Facility Through a core of usable, extensible, dependable, secure distributed infrastructure delivering to the science needs of the stakholders. • Provide mechanisms and help for user groups to adapt their codes and use the OSG. • Provide for opportunistic use of shared resources as well as resource-use through prior agreement. • Provide an integrated, secure, reference software stack for OSG and other Grids. • Grow to be a truly national resource that anyone can join and available for any researcher. Scientific Advisory Group 11/6/2015 2 Benefits to Science and Research • Enable scientists to use a greater % of the available compute cycles. • Help scientists to use distributed systems and software with less effort. • Enable more sharing and reuse of software and reduce duplication of effort through providing effort in integration and extensions. • Establish “open-source” community working together to communicate knowledge and experience and also overheads for new participants. Scientific Advisory Group 11/6/2015 3 Cost-Value Model • Increased usage of CPUs and infrastructure alone (ie cost of processing cycles) - is not the persuading cost-benefit value. • The benefits come from reducing risk in and sharing support for large, complex systems which must be run for many years with a short life-time workforce. Savings in effort for integration, system and software support, Opportunity and flexibility to distribute load and address peak needs. Maintainance of an experienced workforce in a common system Lowering the cost of entry to new contributors. Enabling of new computational opportunities to communities that would not otherwise have access to such resources. Scientific Advisory Group 11/6/2015 4 OSG in a nutshell Scientific Advisory Group 11/6/2015 5 History LIGO operation LIGO preparation LHC construction, preparation LHC Ops iVDGL(NSF) GriPhyN(NSF) Trillium Grid3 OSG (DOE+NSF) PPDG (DOE) 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 European Grid + Worldwide LHC Computing Grid Grid projects established working collaborations between Condor, Globus & physics experiments. OSG leadership led a “grass-roots” collaboration of these projects,. US LHC program committed to a joint project with broader contributions and goals. LIGO committed to a data grid model. DOE and NSF accepted a joint SciDAC and unsolicited NSF proposal. Scientific Advisory Group 11/6/2015 Campus, regional grids 6 The Consortium and the Project • The Consortium comprises all institutions and projects that contribute to OSG. • The Project is funded to provide staff for specific aspects of managing and sustaining the OSG. • The deliverables and milestones of the project are for the scientific needs of the consortium members. • All OSG activities involve both Project staff and contributors from the Consortium. Scientific Advisory Group 11/6/2015 7 Structure of the Consortium Scientific Advisory Group 11/6/2015 8 Scope of the OSG Project • Included: The distributed facility operation and maintenance. Training and education of new participants and contributors. Management and administration of the project and the consortium. Extensions to and integration of new services, software, capabilities and user communities. • Not Included: Resources - farms and storage - are contributed.Currently access to ~34K cores, 2 PB disk storage. Software - facility and application - is developed by external projects with own priorities and schedules. Scientific Advisory Group 11/6/2015 9 Structure of the Project FTEs Planned Facility Management and operations. 7.0 Security & troubleshooting 4.5 Software release & support 6.5 Engagement 2.0 Training & Education 2.0 Extensions 8.0 Executive Director and administration 3.0 Total Scientific Advisory Group 11/6/2015 33 10 What do I do as Executive Director? • Work with the area coordinators & institutional PIs. To define and execute the program of work. To make expectations and what happens come together, be communicated and understood. To match needs, priorities, and effort. Many OSG staff are fractions of an FTE. To collaborate with external software development projects on which we depend • Work with the Council & Consortium As the interface to the project in many areas. On large-scale requests for use of the resources. On agreements with partners for bi-lateral commitments. On extending our membership and participation. Organize reviews, Joint Oversight Team presentations and Consortium meetings. • Communicate a lot e.g. Represent OSG on the WLCG Management Board. Interface to the funding agencies. Present OSG in various meetings. Scientific Advisory Group 11/6/2015 11 Some of the Challenges? • Making the consortium and project work with people from different organization cultures and with different goals in terms of “success”. • Bringing a focus on operations and stability rather than development and “innovation”. • Balancing directed needs of stakeholders with broader scope of commitments. Scientific Advisory Group 11/6/2015 12 Institutions Involved Project Staff FTEs Sites on OSG : Many with >1 resource. 46 separate institutions. Boston 0.5 * - no physics U. Of Michigan Florida State U. Nebraska U. Of Arkansas * BNL 3.0 Kansas State LBNL U. Of Chicago CalTech 2.0 U of Iowa Notre Dame U. California at Riverside Columbia 0.5 Cornell 0.5 Academia Sinica Hampton U Penn State U UCSD FermiLab 7.0 Brookhaven National Lab UERJ Brazil Oaklahoma U. U. Of Florida ISI (year 1) 0.5 Boston U. Iowa State SLAC U. Illinois Chicago Indiana U 3.0 Cinvestav, Mexico City Indiana University Purdue U. U. New Mexico LBNL 1.5 RENCI 1.5 Caltech Lehigh University * Rice U. U. Texas at Arlington SLAC 0.5 UCSD 2.0 Louisiana University Southern Methodist U. U. Virginia U of Chicago 3.0 U of Florida 0.5 Dartmouth U * Louisiana Tech * U. Of Sao Paolo U. Wisconsin Madison U of Iowa 1.0 Florida International U. McGill U Wayne State U. U. Wisconsin Milwaukee Wisconsin 6.0 Clemson U. * Fermilab MIT TTU Scientific Advisory Group 11/6/2015 Total 33 Vanderbilt U. 13 Users and Communities/VOs Campus Grids: 5. Research VOs: 15. 5 are non-physics Georgetown University Grid (GUGrid) Collider Detector at Fermilab (CDF) Grid Laboratory of Wisconsin (GLOW) Compact Muon Solenoid (CMS) Grid Research and Education Group at Iowa (GROW) CompBioGrid (CompBioGrid) D0 Experiment at Fermilab (DZero) Dark Energy Survey (DES) University of New York at Buffalo (GRASE) Fermi National Accelerator Center (Fermilab) Functional Magnetic Resonance Imaging (fMRI) Regional Grids: 4 Geant4 Software Toolkit (geant4) NYSGRID Genome Analysis and Database Update (GADU) Distributed Organization for Scientific and Academic Research (DOSAR) International Linear Collider (ILC) Laser Interferometer Gravitational-Wave Observatory (LIGO) nanoHUB Network for Computational Nanotechnology (NCN) (nanoHUB) Sloan Digital Sky Survey (SDSS) Solenoidal Tracker at RHIC (STAR) Structural Biology Grid (SBGrid) United States ATLAS Collaboration (USATLAS) Scientific Advisory Group 11/6/2015 Great Plains Network (GPN) Northwest Indiana Computational Grid (NWICG) OSG Operated VOs: 4 Engagement (Engage) Open Science Grid (OSG) OSG Education Activity (OSGEDU) OSG Monitoring & Operations 14 CPUHours/Day on OSG During 2007 160,000 140,000 120,000 100,000 80,000 60,000 40,000 20,000 1/ 1/ 0 1/ 7 8 1/ /07 15 1/ /07 22 1/ /07 29 /0 2/ 7 5/ 2/ 07 12 2/ /07 19 2/ /07 26 /0 3/ 7 5 3/ /07 12 3/ /07 19 3/ /07 26 /0 4/ 7 2/ 0 4/ 7 9 4/ /07 16 4/ /07 23 4/ /07 30 /0 5/ 7 7/ 5/ 07 14 5/ /07 21 5/ /07 28 /0 7 0 AGLT2 FNAL_CDFOSG_1 GRASE-GENESEO-OSG Nebraska Purdue-Lear UC_ATLAS_MWT2 USCMS-FNAL-WC1-CE2 ASGC_OSG FNAL_CDFOSG_2 GROW-PROD NERSC-PDSF Purdue-RCAC UCRHEP UTA_SWT2 BNL_OSG FNAL_FERMIGRID HEPGRID_UERJ OSG_LIGO_PSU SPRACE UCSDT2 UTA-DPCC BNL_PANDA FNAL_GPFARM IPAS_OSG OU_OCHEP_SWT2 STAR-BNL UFlorida-IHEPA UWMilwaukee CIT_CMS_T2 GLOW Lehigh Coral OU_OSCER_ATLAS STAR-WSU UFlorida-PG Vanderbilt FIU-PG GRASE-CCR-U2 MIT_CMS OU_OSCER_CONDOR TTU-ANTAEUS USCMS-FNAL-WC1-CE Currently undercounting probably ~25% as not all sites are reporting. 1 CPUYear ~9,000 CPUHours Scientific Advisory Group 11/6/2015 15 National Activities • We interoperate and collaborate with TeraGrid through: Several communities run applications across both. Several sites are on both. Common Condor and Globus versions and testing infrastructure. Shared training exercises. • We promote the development of local infrastructures and expertise. CampusInfrastructure Days with Internet2, TeraGrid, Educause to help Campus (CIOs, researchers, teaching departments) identify crosscampus needs and organize themselves to participate. Scientific Advisory Group 11/6/2015 16 International Activities • We deliver the US contribution to the World Wide Large Hadron Collider (LHC) Grid Collaboration (WLCG) in support of the LHC experiments. Importance of interoperability and compatability with other WLCG infrastructures. • Several communities run jobs and transfer data across the Enabling Grids for EScience (EGEE) and OSG. • Several sites and partners are international. Scientific Advisory Group 11/6/2015 17 CPUHours/Day by VO 160,000 140,000 120,000 100,000 80,000 60,000 40,000 20,000 cdf gpn sdss 7 /0 28 7 5/ /0 7 21 /0 dzero engage miniboone mipp 5/ 14 07 5/ 7/ 7 5/ /0 30 7 4/ /0 7 23 /0 4/ 16 07 dosar LIGO 4/ 4/ 9/ 07 2/ 7 des kt ev 4/ 26 /0 7 7 /0 3/ 19 3/ /0 12 07 cms ilc zeus 3/ 5/ 7 3/ /0 26 7 cdms grow st ar 2/ 7 /0 19 2/ /0 12 2/ 2/ 5/ 07 7 /0 29 7 AT LAS glow osg 1/ 7 /0 22 /0 1/ 15 07 1/ 8/ 1/ 1/ 1/ 07 0 gadu nanohub Engage is running Rosetta at home, from the Kulhman Lab. OSG is running protein molecular dynamics (-CHARMM), Johns Hopkins Scientific Advisory Group 11/6/2015 18 We measure how we are doing • Summaries of support requests and resolutions. • Accounting information of CPU, storage and data transfer by site and VO includes shared and opportunistic resource use Includes information from the user accounting systems. Includes some error reporting information. • Availability testing, monitoring,& display. Feedback from agencies is that we need more of this. Scientific Advisory Group 11/6/2015 19 How do we know if we are doing well? • Feedback from users and sites important and ongoing - mail lists, weekly operations meetings, Council meetings. • Gathering information for research briefs, monthly news articles, gives us a feel for if the use of OSG is benefiting scientific and research output. • Project deliverables and milestones give a measure of how well the project is executing its plans. Scientific Advisory Group 11/6/2015 20 Project Planning • Overall 5 year goals and milestones come from the proposal. • A yearly plan of work is made with the Area Coordinators which results in: deliverables, activities & schedule (captured in a WBS structure). high level milestones - some agency reportable, effort assignments. • We have signed Statements of Work with each institutional PI with project funds. There is a working change control process. • We revise our plans via weekly Executive Team and every-six week Executive Board meetings. Scientific Advisory Group 11/6/2015 21 Project Tracking • Milestones are tracked by the Project Associate and discussed in weekly Executive Team meetings. • Area coordinators and Institutional PIs submit quarterly reports. • Accounted expenditures are tracked quarterly. • Staff submit monthly reports. • Weekly area and activity meetings are used for day to day tracking and discussion of progress. Scientific Advisory Group 11/6/2015 22 Resource Needs and Resource Availability • Many resources are owned or statically allocated to one user community. The institutions which own resources typically have ongoing relationships with (a few) particular user communities (VOs) • The remainder of an organization’s available resources can be “used by everyone or anyone else”. organizations can decide against supporting particular VOs. OSG staff are responsible for monitoring and, if needed, managing this usage. • Our challenge is to maximize good successful - output from the whole system. Scientific Advisory Group 11/6/2015 23 An Example: D0 reprocessing • D0’s own resources are committed to the processing of newly acquired data and analysis of the processed datasets. • In Nov ‘06 D0 asked to use 1500-2000 CPUs for 2-4 months for re-processing of an existing dataset (~500 million events) for science results for the summer conferences in July ‘07. • The Executive Board estimated there were currently sufficient opportunistically available resources on OSG to meet the request; We also looked into the local storage and I/O needs. • The Council members agreed to contribute resources to meet this request. Scientific Advisory Group 11/6/2015 24 How did D0 Reprocessing Go? • D0 had 2-3 months of smooth production running using >1,000 CPUs and met their goal by the end of May. • To achieve this D0 testing of the integrated software system took until February. OSG staff and D0 then worked closely together as a team to reach the needed throughput goals facing and solving problems sites - hardware, connectivity, software configurations application software - performance, error recovery scheduling of jobs to a changing mix of available resources. Scientific Advisory Group 11/6/2015 25 The Results • Reprocessing was completed, albeit late. 445 million events were reprocessed 12 sites contributed significant resources. Over 1000 jobs a day was sustained. • • • • 286 million events were done on OSG sites The initial rampup to scale was slow and labor intensive for both D0 and OSG. Changes in availability of resources had negative impact. Sustaining the throughput was manpower intensive on the D0 side. Problems encountered: Each site had unique problems when initially used. Sites were less stable than expected. Root cause diagnosis and analysis of problems was very difficult. Scaling up showed problems in throughput and overheads. Scientific Advisory Group 11/6/2015 26 D0 Throughput D0 Event Throughput D0 OSG CPUHours / Week 160,000 140,000 120,000 100,000 80,000 60,000 40,000 20,000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Week in 2007 CIT_CMS_T2 FNAL_GPFARM MIT_CMS NERSC-PDSF OU_OSCER_CONDOR UCSDT2 USCMS-FNAL-WC1-CE Scientific Advisory Group 11/6/2015 FNAL_DZEROOSG_2 GLOW MWT2_IU OSG_LIGO_PSU Purdue-RCAC UFlorida-IHEPA FNAL_FERMIGRID GRASE-CCR-U2 Nebraska OU_OSCER_ATLAS SPRACE UFlorida-PG 27 What did this teach us ? • Consortium members contributed significant opportunistic resources as promised. • VOs can use a significant number of sites they “don’t own” to achieve a large effective throughput. • Combined teams make large production runs effective. How does this scale? • Overall availability was sufficient for the request to be met. how we going to support multiple requests that oversubcribe the resources? We anticipate this may happen soon. Scientific Advisory Group 11/6/2015 28 Training • Grid Schools train students, teachers and new entrants to use grids: 2-3 day training with hands on workshops and core curriculum (based on iVDGL annual weeklong schools). 3 held already; several more this year (2 scheduled). Some as participants in internationals schools. 20-60 in each class. Each class regionally based with broad cachement area. Gathering an online repository of training material. • End-to-end application training in collaboration with user communities. Scientific Advisory Group 11/6/2015 29 Education • We participate as part of cyberinfrastructure educational projects: I2U2 extension to Quarknet project. site at South African University. • Student Projects: Now the new Education Coordinator is starting, we will follow up with students and their organizations to help them use OSG for projects and research. Scientific Advisory Group 11/6/2015 30 Some of the Challenges I worry about • How do we ensure, measure and show scientific benefit both to our existing stakeholders and new communities? • What activities do we need towards a sustainable economic model for operation and support? Scientific Advisory Group 11/6/2015 31