Transcript OSG Integration - Dipartimento di Matematica e Applicazioni
The Open Science Grid
Miron Livny OSG Facility Coordinator University of Wisconsin-Madison
Some History and background …
2
U.S. “Trillium” Grid Partnership Trillium = PPDG + GriPhyN + iVDGL Particle Physics Data Grid: $18M (DOE) GriPhyN: iVDGL: $12M (NSF) $14M (NSF) (1999 – 2006) (2000 – 2005) (2001 – 2006) Basic composition (~150 people) PPDG: 4 universities, 6 labs GriPhyN: 12 universities, SDSC, 3 labs iVDGL: Expts: 18 universities, SDSC, 4 labs, foreign partners BaBar, D0, STAR, Jlab, CMS, ATLAS, LIGO, SDSS/NVO Complementarity of projects GriPhyN: CS research, Virtual Data Toolkit (VDT) development PPDG: iVDGL: “End to end” Grid services, monitoring, analysis Grid laboratory deployment using VDT Experiments provide frontier challenges Unified entity when collaborating internationally 4/23/2020 3
4/23/2020
From Grid3 to OSG
OSG 0.2.1
OSG 0.4.0
OSG 0.4.1
OSG 0.6.0
4
What is OSG? The Open Science Grid is a US national distributed computing facility that supports scientific computing via an open collaboration of science researchers, software developers and computing, storage and network providers. The OSG is building and operating the OSG, bringing resources and researchers from universities and national laboratories together and cooperating with other national and international infrastructures to give scientists from a broad range of disciplines access to shared resources worldwide.
Consortium 5
The OSG Project Co-funded by DOE and NSF at an annual rate of ~$6M for 5 years starting FY-07 Currently main stakeholders are from physics - US LHC experiments, LIGO, STAR experiment, the Tevatron Run II and Astrophysics experiments A mix of DOE-Lab and campus resources Active “engagement” effort to add new domains and resource providers to the OSG consortium 6
OSG Consortium 7
OSG Project Execution OSG PI Miron Livny Resources Managers Paul Avery, Albert Lazzarini Executive Director Ruth Pordes √ Role includes Provision of middleware OSG Executive Board √ External Projects Deputy Executive Director Rob Gardner, Doug Olson Education, Training, Outreach Coordinator: Mike Wilde √ Facility Coordinator Miron Livny √ Applications Coordinators Torre Wenaus, Frank Würthwein √ Security Officer Don Petravick Operations Coordinator Leigh Grundhoefer √ Engagement Coordinator Alan Blatecky √ Software Coordinator Alain Roy
OSG Principles Characteristics Provide guaranteed and opportunistic access to shared resources.
Operate a heterogeneous environment both in services available at any site and for any VO, and multiple implementations behind common interfaces.
Interface to Campus and Regional Grids. Federate with other national/international Grids.
Support multiple software releases at any one time.
Drivers Delivery to the schedule, capacity and capability of LHC and LIGO: Contributions to/from and collaboration with the US ATLAS, US CMS, LIGO software and computing programs. Support for/collaboration with other physics/non-physics communities.
Partnerships with other Grids - especially EGEE and TeraGrid.
Evolution by deployment of externally developed new services and technologies:.
9
Grid of Grids - from Local to Global National Campus Community 10
Who are you? A resource can be accessed by a user via the campus, community or national grid.
A user can access a resource with a campus, community or national grid identity.
11
OSG sites 12
running (and monitored) “OSG jobs” in 06/06.
13
Example GADU run in 04/06 14
CMS Experiment - an exemplar community grid OSG EGEE CERN Taiwan Italy UK Purdue Wisconsin Caltech USA Florida Germany UNL MIT UCSD France
Data & jobs moving locally, regionally & globally within CMS grid.
Transparently across grid boundaries from campus to global.
15
The CMS Grid of Grids Job submission: 16,000 jobs per day submitted across EGEE & OSG via INFN Resource Broker (RB).
Data Transfer: Peak IO of 5Gbps from FNAL to 32 EGEE and 7 OSG sites.
All 7 OSG sites have reached 5TB/day goal.
3 OSG sites (Caltech, Florida, UCSD) exceeded 10TB/day.
16
CMS Xfer on OSG All sites have exceeded 5TB per day in June.
17
CMS Xfer FNAL to World The US CMS center at FNAL transfers data to 39 sites worldwide in CMS global Xfer challenge Peak Xfer rates of ~5Gbps are reached.
18
EGEE–OSG inter-operability Agree on a common Virtual Organization Management System (VOMS) Active Joint Security groups: leading to common policies and procedures.
Condor-G interfaces to multiple remote job execution services (GRAM, Condor-C).
File Transfers using GridFTP.
SRM V1.1 for managed storage access. SRM V2.1 in test. Publish OSG BDII to shared BDII for Resource Brokers to route jobs across the two grids.
Automate ticket routing between GOCs. 19
OSG Middleware Layering LIGO Data Grid CMS Services & Framework ATLAS Services & Framework CDF, D0 SamGrid & Framework … OSG Release Cache: VDT + Configuration, Validation, VO management Virtual Data Toolkit (VDT) Common Services NMI + VOMS, CEMon (common EGEE components), MonaLisa, Clarens, AuthZ NSF Middleware Initiative (NMI): Condor, Globus, Myproxy 20
OSG Middleware Pipeline Domain science requirements.
Condor, Globus, EGEE etc OSG stakeholders and middleware developer (joint) projects.
Test on “VO specific grid” Integrate into VDT Release. Deploy on OSG integration grid Test Interoperability With EGEE and TeraGrid Provision in OSG release & deploy to OSG production.
21
The Virtual Data Toolkit
Alain Roy OSG Software Coordinator Condor Team University of Wisconsin-Madison
What is the VDT?
A collection of software Grid software (Condor, Globus and lots more) Virtual Data System (Origin of the name “VDT”) Utilities An easy installation Goal: Push a button, everything just works Two methods: Pacman: installs and configures it all RPM: installs some of the software, no configuration A support infrastructure 23
How much software?
24
Who makes the VDT?
The VDT is a product of Open Science Grid (OSG) VDT is used on all OSG grid sites OSG is new, but VDT has been around since 2002 Originally, VDT was a product of the GriPhyN/iVDGL VDT was used on all Grid2003 sites 25
Who makes the VDT?
1 Mastermind + 3 FTEs Miron Livny Alain Roy Tim Cartwright Andy Pavlo 26
Who uses the VDT?
Open Science Grid LIGO Data Grid LCG LHC Computing Grid, from CERN EGEE Enabling Grids for E-Science 27
Why should you care?
The VDT gives insight into technical challenges in building a large grid What software do you need?
How do you build it?
How do you test it?
How do you deploy it?
How do you support it?
28
What software is in the VDT?
Job Management Condor (including Condor-G & Condor-C) talk today than the specific software Data Management Security VOMS (VO membership) GUMS (local authorization) MyProxy (proxy management) GSI SSH GridFTP (data transfer) RLS (replication location) DRM (storage management) Globus RFT Information Services Globus MDS GLUE schema & providers CA CRL updater Monitoring MonaLISA gLite CEMon Accounting OSG Gratia 29
What software is in the VDT?
Client tools Virtual Data System SRM clients (V1 and V2) UberFTP (GridFTP client) Developer Tools PyGlobus PyGridWare Testing NMI Build & Test VDT Tests Support Apache Tomcat MySQL (with MyODBC) Non-standard Perl modules Wget Squid Logrotate Configuration Scripts And More!
30
Building the VDT We distribute binaries Expecting everyone to build from source is impractical Essential to be able to build on many platforms, and replicate builds We build all binaries with NMI Build and Test infrastructure 31
Building the VDT Sources (CVS) NMI Build & Test Condor pool (70+ computers) Patching Build Binaries Package Test … VDT RPM downloads Test Pacman Cache Users Binaries Build Contributors 32
Testing the VDT Every night, we test: Full VDT install Subsets of VDT Current release: You might be surprised how often things break after release!
Upcoming release On all supported platforms Supported means “we test it every night” VDT works on some unsupported platforms We care about interactions between the software 33
Supported Platforms RedHat 7 RedHat 9 RHAS 4 Scientific Linux 3 Fedora Core 3 Fedora Core 4 Debian 3.1
RHAS 3 ti Fedora Core 4/x86-64 ROCKS 3.3
RHAS 3/x86-64 SuSE 9/ia64 34
Tests Results on web Results via email A daily reminder!
35
Deploying the VDT We want to support root and non root installations We want to assist with configuration We want it to be simple Our solution: Pacman Developed by Saul Youssef, BU Downloads and installs with one command Asks questions during install (optionally) Does not require root Can install multiple versions at same time 36
Challenges we struggle with How should we smoothly update a production service?
In-place vs. on-the-side Preserve old configuration while making big changes.
As easy as we try to make it, it still takes hours to fully install and set up from scratch How do we support more platforms?
It’s a struggle to keep up with the onslaught of Linux distributions Mac OS X? Solaris? 37
More challenges Improving testing We care about interactions between the software: “When using a VOMS proxy with Condor-G, can we run a GT4 job with GridFTP transfer, keeping the proxy in MyProxy, while using PBS as the backend batch system…” Some people want native packaging formats RPM Deb What software should we have?
New storage management software 38
One more challenge Hiring We need high quality software developers Creating the VDT involves all aspects of software development But: Developers prefer writing new code instead of Writing lots of little bits of code Thorough testing Lots of debugging User support 39
Where do you learn more?
http://vdt.cs.wisc.edu
Support: Alain Roy: Miron Livny: [email protected]
Official Support: [email protected]
40
Security Infrastructure Identity: X509 Certificates OSG is a founding member of the US TAGPMA.
DOEGrids provides script utilities for bulk requests of Host certs, CRL checking etc.
VDT downloads CA information from IGTF.
Authentication and Authorization using VOMS extended attribute certficates.
DN-> Account mapping done at Site (multiple CEs, SEs) by GUMS.
Standard authorization callouts to Prima(CE) and gPlazma(SE).
41
Security Infrastructure Security Process modeled on NIST procedural controls starting from an inventory of the OSG assets: Management - Risk assessment, planning, Service auditing and checking Operational - Incident response, Awareness and Training, Configuration management, Technical - Authentication and Revocation, Auditing and analysis. End to end trust in quality of code executed on remote CPU signatures?
42
User and VO Management VO Registers with with Operations Center Provides URL for VOMS service to be propagated to the sites. Several VOMS are shared with EGEE as part of WLCG.
User registers through VOMRS or VO administrator User added to VOMS of one or more VOs.
VO responsible for users to sign AUP.
VO responsible for VOMS service support.
Site Registers with the Operations Center Signs the Service Agreement.
Decides which VOs to support (striving for default admit) Populates GUMS from VOMSes of all VOs. Chooses account UID policy for each VO & role.
VOs and Sites provide Support Center Contact and joint Operations.
For WLCG: US ATLAS and US CMS Tier-1s directly registered to WLCG. Other support centers propagated through OSG GOC to WLCG.
43
Operations and User Support Virtual Organization (VO) Group of one or more researchers Resource provider (RP) Operates Compute Elements and Storage Elements Support Center (SC) SC provides support for one or more VO and/or RP VO support centers Provide end user support including triage of user-related trouble tickets Community Support Volunteer effort to provide SC for RP for VOs without their own SC, and general help discussion mail list 44
Operations Model
Lines represent communication paths and, in our model, agreements.
We have not progressed very far with agreements yet.
Real support organizations often play multiple roles Gray shading indicates that OSG Operations composed of effort from all the support centers
45
OSG Release Process Applications Integration Provision Deploy Integration Testbed (15-20) Production (50+) sites ITB OSG
Sao Paolo Taiwan, S.Korea
46
Integration Testbed As reported in GridCat status catalog ITB release service facility site Ops map Tier 2 sites status 47
Release Schedule Incremental Updates Incremental Updates (minor release) Incremental Updates OSG 0.8.0
OSG 0.4.1
OSG 0.6.0
OSG 1.0.0!
OSG 0.4.0
SC4 CMS CSA06 Advanced LIGO 01/06 03/06 06/06 09/06 12/06 ATLAS Cosmic Ray Run 03/07 06/07 09/07 12/07 WLCG Service Commissioned 3/08 6/08 9/08 48
OSG Release Timeline OSG 0.2.1
OSG 0.4.0
Production OSG 0.4.1
ITB 0.1.2
ITB 0.1.6
ITB 0.3.0
ITB 0.3.4
ITB 0.3.7
ITB 0.5.0
OSG 0.6.0
Integration 49
Deployment and Maintenance Distribute s/w through VDT and OSG caches.
Progress technically via VDT weekly office hours - problems, help, planning - fed from multiple sources (Ops, Int, VDT-Support, mail, phone).
Publish plans and problems through VDT “To do list”, Int-Twiki and ticket systems.
Critical updates and patches follow Standard Operating Procedures. 50
Release Functionality OSG 0.6 Fall 2006 Accounting; Squid (Web caching in support of s/w distribution + database information); OSG 0.8 Spring 2007 VM based Edge Services; Just in time job scheduling, Pull Mode Condor-C, SRM V2+AuthZ; CEMON-ClassAd based Resource Selection.
Support for sites to run Pilot jobs and/or Glide-ins using gLexec for identity changes .
Support for MDS-4.
OSG1.0 End of 2007 51
Inter-operability with Campus grids FermiGrid is an interesting example for the challenges we face when making the resources of a campus (in this case a DOE Laboratory) grid accessible to the OSG community 52
OSG Principles Characteristics Provide guaranteed and opportunistic access to shared resources.
Operate a heterogeneous environment both in services available at any site and for any VO, and multiple implementations behind common interfaces.
Interface to Campus and Regional Grids. Federate with other national/international Grids.
Support multiple software releases at any one time.
Drivers Delivery to the schedule, capacity and capability of LHC and LIGO: Contributions to/from and collaboration with the US ATLAS, US CMS, LIGO software and computing programs. Support for/collaboration with other physics/non-physics communities.
Partnerships with other Grids - especially EGEE and TeraGrid.
Evolution by deployment of externally developed new services and technologies:.
53
OSG Middleware Layering LIGO Data Grid CMS Services & Framework ATLAS Services & Framework CDF, D0 SamGrid & Framework … OSG Release Cache: VDT + Configuration, Validation, VO management Virtual Data Toolkit (VDT) Common Services NMI + VOMS, CEMon (common EGEE components), MonaLisa, Clarens, AuthZ NSF Middleware Initiative (NMI): Condor, Globus, Myproxy 54
Summary OSG facility opened July 22nd 2005.
— — — OSG facility is under steady use ~2-3000 jobs at all times HEP but large Bio/Eng/Med occasionally Moderate other physics - Astro/Nuclear - LIGO expected to ramp up.
— OSG project — 5 year Proposal to DOE & NSF funded starting 9/06 Facility & Improve/Expand/Extend/Interoperate & E&O — — — Off to a running start … but lot’s more to do.
— — — Routinely exceeding 1Gbps at 3 sites Scale by x4 by 2008 and many more sites Routinely exceeding 1000 running jobs per client Scale by at least x10 by 2008 Have reached 99% success rate for 10,000 jobs per day submission Need to reach this routinely, even under heavy load 55
EGEE–OSG inter-operability Agree on a common Virtual Organization Management System (VOMS) Active Joint Security groups: leading to common policies and procedures.
Condor-G interfaces to multiple remote job execution services (GRAM, Condor-C).
File Transfers using GridFTP.
SRM V1.1 for managed storage access. SRM V2.1 in test. Publish OSG BDII to shared BDII for Resource Brokers to route jobs across the two grids.
Automate ticket routing between GOCs. 56
What is FermiGrid?
Integrates resources across most (soon all) owners at Fermilab.
Supports jobs from Fermilab organizations to run on any/all accessible campus FermiGrid and national Open Science Grid resources. Supports jobs from OSG to be scheduled onto any/all Fermilab sites,.
Unified and reliable common interface and services for FermiGrid gateway - including security, job scheduling, user management, and storage. More information is available at http://fermigrid.fnal.gov
57
Job Forwarding and Resource Sharing Gateway currently interfaces 5 Condor pools with diverse file systems and >1000 Job Slots. Plans to grow to 11 clusters (8 Condor, 2 PBS and 1 LSF) Job scheduling policies and in place agreements for sharing allow fast response to changes in resource needs by Fermilab and OSG users.
Gateway provides single bridge between OSG wide area distributed infrastructure and FermiGrid local sites. Consists of a Glbus gate-keeper and a Condor-G Each cluster has its own Globus gate-keeper Storage and Job execution policies applied through Site-wide managed security and authorization services.
58
Access to FermiGrid
Fermilab Users
GT-GK FermiGrid Gateway Condor-G GT-GK
CDF Condor pool
GT-GK
DZero Condor pool
GT-GK
Shared Condor pool OSG General Users
OSG “agreed” Users GT-GK
CMS Condor pool
59
GLOW: UW Enterprise Grid
• Condor pools at various departments integrated into a campus wide grid –Grid Laboratory of Wisconsin • Older private Condor pools at other departments – ~1000 ~1GHz Intel CPUs at CS – ~100 ~2GHz Intel CPUs at Physics – … • Condor jobs flock from on-campus and of-campus to GLOW • Excellent utilization – Especially when the Condor Standard Universe is used • Premption, Checkpointing, Job Migration 4/23/2020 60
Grid Laboratory of Wisconsin
2003 Initiative funded by NSF/UW Six GLOW Sites • Computational Genomics, Chemistry • Amanda, Ice-cube, Physics/Space Science • High Energy Physics/CMS, Physics • Materials by Design, Chemical Engineering • Radiation Therapy, Medical Physics • Computer Science GLOW phases-1,2 + non-GLOW funded nodes have ~1000 Xeons + 100 TB disk 4/23/2020 61
How does it work?
• Each of the six sites manages a local Condor pool with its own collector and matchmaker • Through the High Availability Demon (HAD) service offered by Condor, one of these matchmaker is elected to manage all GLOW resources 4/23/2020 62
GLOW Deployment
• GLOW is fully Commissioned and is in constant use – CPU • 66 GLOW +
50 ATLAS
+
108 other nodes
@ CS • 74 GLOW +
66 CMS nodes
• 93 GLOW nodes @ ChemE @ Physics • 66 GLOW nodes @ LMCG, MedPhys, Physics • 95 GLOW nodes @ MedPhys • 60 GLOW nodes @ IceCube • Total CPU: ~1339 – Storage • Head nodes @ at all sites QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
• 45 TB each @ CS and Physics • Total storage: ~ 100 TB • GLOW Resources are used at 100% level – Key is to have multiple user groups • GLOW continues to grow 4/23/2020 63
GLOW Usage
• GLOW Nodes are always running hot!
– CS + Guests • Largest user • Serving guests - many cycles delivered to guests!
– ChemE • Largest community – HEP/CMS • Production for collaboration • Production and analysis of local physicists – LMCG • Standard Universe – Medical Physics • MPI jobs – IceCube • Simulations 4/23/2020 64
GLOW Usage 3/04 – 9/05
Leftover cycles available for “Others” Takes advantage of “shadow” jobs Take advantage of check-pointing jobs Over 7.6 million CPU-Hours (865 CPU-Years) served!
4/23/2020 65
Example Uses
• ATLAS – Over 15 Million proton collision events simulated at 10 minutes each • CMS – Over 70 Million events simulated, reconstructed and analyzed (total ~10 minutes per event) in the past one year • IceCube / Amanda – Data filtering used 12 years of GLOW CPU in one month • Computational Genomics – Prof. Shwartz asserts that GLOW has opened up a new paradigm of work patterns in his group • They no longer think about how long a particular computational job will take - they just do it • Chemical Engineering – Students do not know where the computing cycles are coming from - they just do it - largest user group 4/23/2020 66
Open Science Grid & GLOW
• OSG Jobs can run on GLOW – Gatekeeper routes jobs to local condor cluster – Jobs flock to campus wide, including the GLOW resources – dCache storage pool is also a registered OSG storage resource – Beginning to see some use • Now actively working on rerouting GLOW jobs to the rest of OSG – Users do NOT have to adapt to OSG interface and separately manage their OSG jobs – New Condor code development 4/23/2020 67
Elevating from GLOW to OSG
Specialized scheduler operating on schedd’s jobs.
Job 1 Job 2 Job 3 Job 4 Job 5 … Job 4* job queue Schedd On The Side Schedd
www.cs.wisc.edu/~miron
68
The Grid Universe
Gatekeeper Random Seed Seed Random Seed Seed Seed Seed vanilla site X Schedd X Startds •easier to live with private networks •may use non-Condor resources •restricted Condor feature set (e.g. no std universe over grid) •must pre-allocating jobs between vanilla and grid universe
www.cs.wisc.edu/~miron
69
Dynamic Routing Jobs
•dynamic allocation of jobs between vanilla and grid universes.
•not every job is appropriate for transformation into a grid job.
Random Random Seed Random Random Seed Random Seed Random Seed Random Seed vanilla site X site Y Schedd site Z Schedd On The Side Gatekeeper X Z Local Startds Y
www.cs.wisc.edu/~miron
70
Final Observation … A production grid is the product of a complex interplay of many forces: Resource providers Users Software providers Hardware trends Commercial offerings Funding agencies Culture of all parties involved … 71