OSG Integration - Dipartimento di Matematica e Applicazioni

Download Report

Transcript OSG Integration - Dipartimento di Matematica e Applicazioni

The Open Science Grid

Miron Livny OSG Facility Coordinator University of Wisconsin-Madison

Some History and background …

2

U.S. “Trillium” Grid Partnership  Trillium = PPDG + GriPhyN + iVDGL  Particle Physics Data Grid: $18M (DOE)  GriPhyN:  iVDGL: $12M (NSF) $14M (NSF) (1999 – 2006) (2000 – 2005) (2001 – 2006)  Basic composition (~150 people)  PPDG: 4 universities, 6 labs  GriPhyN: 12 universities, SDSC, 3 labs   iVDGL: Expts: 18 universities, SDSC, 4 labs, foreign partners BaBar, D0, STAR, Jlab, CMS, ATLAS, LIGO, SDSS/NVO  Complementarity of projects  GriPhyN: CS research, Virtual Data Toolkit (VDT) development   PPDG: iVDGL: “End to end” Grid services, monitoring, analysis Grid laboratory deployment using VDT  Experiments provide frontier challenges  Unified entity when collaborating internationally 4/23/2020 3

4/23/2020

From Grid3 to OSG

OSG 0.2.1

OSG 0.4.0

OSG 0.4.1

OSG 0.6.0

4

What is OSG? The Open Science Grid is a US national distributed computing facility that supports scientific computing via an open collaboration of science researchers, software developers and computing, storage and network providers. The OSG is building and operating the OSG, bringing resources and researchers from universities and national laboratories together and cooperating with other national and international infrastructures to give scientists from a broad range of disciplines access to shared resources worldwide.

Consortium 5

The OSG Project  Co-funded by DOE and NSF at an annual rate of ~$6M for 5 years starting FY-07  Currently main stakeholders are from physics - US LHC experiments, LIGO, STAR experiment, the Tevatron Run II and Astrophysics experiments  A mix of DOE-Lab and campus resources  Active “engagement” effort to add new domains and resource providers to the OSG consortium 6

OSG Consortium 7

OSG Project Execution OSG PI Miron Livny Resources Managers Paul Avery, Albert Lazzarini Executive Director Ruth Pordes √ Role includes Provision of middleware OSG Executive Board √ External Projects Deputy Executive Director Rob Gardner, Doug Olson Education, Training, Outreach Coordinator: Mike Wilde √ Facility Coordinator Miron Livny √ Applications Coordinators Torre Wenaus, Frank Würthwein √ Security Officer Don Petravick Operations Coordinator Leigh Grundhoefer √ Engagement Coordinator Alan Blatecky √ Software Coordinator Alain Roy

OSG Principles    Characteristics   Provide guaranteed and opportunistic access to shared resources.

Operate a heterogeneous environment both in services available at any site and for any VO, and multiple implementations behind common interfaces.

Interface to Campus and Regional Grids.  Federate with other national/international Grids.

Support multiple software releases at any one time.

     Drivers Delivery to the schedule, capacity and capability of LHC and LIGO:  Contributions to/from and collaboration with the US ATLAS, US CMS, LIGO software and computing programs. Support for/collaboration with other physics/non-physics communities.

Partnerships with other Grids - especially EGEE and TeraGrid.

Evolution by deployment of externally developed new services and technologies:.

9

Grid of Grids - from Local to Global National Campus Community 10

Who are you?  A resource can be accessed by a user via the campus, community or national grid.

 A user can access a resource with a campus, community or national grid identity.

11

OSG sites 12

running (and monitored) “OSG jobs” in 06/06.

13

Example GADU run in 04/06 14

CMS Experiment - an exemplar community grid OSG EGEE CERN Taiwan Italy UK Purdue Wisconsin Caltech USA Florida Germany UNL MIT UCSD France

Data & jobs moving locally, regionally & globally within CMS grid.

Transparently across grid boundaries from campus to global.

15

The CMS Grid of Grids Job submission: 16,000 jobs per day submitted across EGEE & OSG via INFN Resource Broker (RB).

Data Transfer: Peak IO of 5Gbps from FNAL to 32 EGEE and 7 OSG sites.

All 7 OSG sites have reached 5TB/day goal.

3 OSG sites (Caltech, Florida, UCSD) exceeded 10TB/day.

16

CMS Xfer on OSG All sites have exceeded 5TB per day in June.

17

CMS Xfer FNAL to World The US CMS center at FNAL transfers data to 39 sites worldwide in CMS global Xfer challenge Peak Xfer rates of ~5Gbps are reached.

18

EGEE–OSG inter-operability        Agree on a common Virtual Organization Management System (VOMS) Active Joint Security groups: leading to common policies and procedures.

Condor-G interfaces to multiple remote job execution services (GRAM, Condor-C).

File Transfers using GridFTP.

SRM V1.1 for managed storage access. SRM V2.1 in test. Publish OSG BDII to shared BDII for Resource Brokers to route jobs across the two grids.

Automate ticket routing between GOCs. 19

OSG Middleware Layering LIGO Data Grid CMS Services & Framework ATLAS Services & Framework CDF, D0 SamGrid & Framework … OSG Release Cache: VDT + Configuration, Validation, VO management Virtual Data Toolkit (VDT) Common Services NMI + VOMS, CEMon (common EGEE components), MonaLisa, Clarens, AuthZ NSF Middleware Initiative (NMI): Condor, Globus, Myproxy 20

OSG Middleware Pipeline Domain science requirements.

Condor, Globus, EGEE etc OSG stakeholders and middleware developer (joint) projects.

Test on “VO specific grid” Integrate into VDT Release. Deploy on OSG integration grid Test Interoperability With EGEE and TeraGrid Provision in OSG release & deploy to OSG production.

21

The Virtual Data Toolkit

Alain Roy OSG Software Coordinator Condor Team University of Wisconsin-Madison

What is the VDT?

    A collection of software Grid software (Condor, Globus and lots more) Virtual Data System (Origin of the name “VDT”) Utilities    An easy installation Goal: Push a button, everything just works Two methods:  Pacman: installs and configures it all  RPM: installs some of the software, no configuration  A support infrastructure 23

How much software?

24

Who makes the VDT?

  The VDT is a product of Open Science Grid (OSG) VDT is used on all OSG grid sites  OSG is new, but VDT has been around since 2002   Originally, VDT was a product of the GriPhyN/iVDGL VDT was used on all Grid2003 sites 25

Who makes the VDT?

1 Mastermind + 3 FTEs Miron Livny Alain Roy Tim Cartwright Andy Pavlo 26

Who uses the VDT?

 Open Science Grid  LIGO Data Grid   LCG LHC Computing Grid, from CERN   EGEE Enabling Grids for E-Science 27

Why should you care?

      The VDT gives insight into technical challenges in building a large grid What software do you need?

How do you build it?

How do you test it?

How do you deploy it?

How do you support it?

28

What software is in the VDT?

          Job Management Condor (including Condor-G & Condor-C) talk today than the specific software Data Management    Security VOMS (VO membership) GUMS (local authorization) MyProxy (proxy management) GSI SSH GridFTP (data transfer) RLS (replication location) DRM (storage management)  Globus RFT Information Services Globus MDS GLUE schema & providers       CA CRL updater Monitoring MonaLISA gLite CEMon Accounting OSG Gratia 29

What software is in the VDT?

   Client tools    Virtual Data System SRM clients (V1 and V2) UberFTP (GridFTP client) Developer Tools     PyGlobus PyGridWare Testing NMI Build & Test VDT Tests   Support         Apache Tomcat MySQL (with MyODBC) Non-standard Perl modules Wget Squid Logrotate Configuration Scripts And More!

30

Building the VDT    We distribute binaries Expecting everyone to build from source is impractical Essential to be able to build on many platforms, and replicate builds  We build all binaries with NMI Build and Test infrastructure 31

Building the VDT Sources (CVS) NMI Build & Test Condor pool (70+ computers) Patching Build Binaries Package Test … VDT RPM downloads Test Pacman Cache Users Binaries Build Contributors 32

Testing the VDT        Every night, we test: Full VDT install Subsets of VDT Current release: You might be surprised how often things break after release!

Upcoming release On all supported platforms  Supported means “we test it every night”  VDT works on some unsupported platforms We care about interactions between the software 33

Supported Platforms    RedHat 7 RedHat 9 RHAS 4 Scientific Linux 3 Fedora Core 3 Fedora Core 4 Debian 3.1

RHAS 3 ti  Fedora Core 4/x86-64 ROCKS 3.3

RHAS 3/x86-64  SuSE 9/ia64 34

Tests  Results on web  Results via email  A daily reminder!

35

Deploying the VDT  We want to support root and non root installations  We want to assist with configuration  We want it to be simple       Our solution: Pacman Developed by Saul Youssef, BU Downloads and installs with one command Asks questions during install (optionally) Does not require root Can install multiple versions at same time 36

Challenges we struggle with     How should we smoothly update a production service?

In-place vs. on-the-side Preserve old configuration while making big changes.

As easy as we try to make it, it still takes hours to fully install and set up from scratch    How do we support more platforms?

It’s a struggle to keep up with the onslaught of Linux distributions Mac OS X? Solaris? 37

More challenges   Improving testing We care about interactions between the software: “When using a VOMS proxy with Condor-G, can we run a GT4 job with GridFTP transfer, keeping the proxy in MyProxy, while using PBS as the backend batch system…”    Some people want native packaging formats RPM Deb   What software should we have?

New storage management software 38

One more challenge     Hiring We need high quality software developers Creating the VDT involves all aspects of software development But: Developers prefer writing new code instead of  Writing lots of little bits of code    Thorough testing Lots of debugging User support 39

Where do you learn more?

 http://vdt.cs.wisc.edu

    Support: Alain Roy: Miron Livny: [email protected]

[email protected]

Official Support: [email protected]

40

Security Infrastructure     Identity: X509 Certificates OSG is a founding member of the US TAGPMA.

DOEGrids provides script utilities for bulk requests of Host certs, CRL checking etc.

VDT downloads CA information from IGTF.

   Authentication and Authorization using VOMS extended attribute certficates.

DN-> Account mapping done at Site (multiple CEs, SEs) by GUMS.

Standard authorization callouts to Prima(CE) and gPlazma(SE).

41

Security Infrastructure  Security Process modeled on NIST procedural controls starting from an inventory of the OSG assets:    Management - Risk assessment, planning, Service auditing and checking Operational - Incident response, Awareness and Training, Configuration management, Technical - Authentication and Revocation, Auditing and analysis. End to end trust in quality of code executed on remote CPU signatures?

42

User and VO Management    VO Registers with with Operations Center Provides URL for VOMS service to be propagated to the sites. Several VOMS are shared with EGEE as part of WLCG.

      User registers through VOMRS or VO administrator User added to VOMS of one or more VOs.

VO responsible for users to sign AUP.

VO responsible for VOMS service support.

    Site Registers with the Operations Center Signs the Service Agreement.

Decides which VOs to support (striving for default admit) Populates GUMS from VOMSes of all VOs. Chooses account UID policy for each VO & role.

VOs and Sites provide Support Center Contact and joint Operations.

For WLCG: US ATLAS and US CMS Tier-1s directly registered to WLCG. Other support centers propagated through OSG GOC to WLCG.

43

Operations and User Support   Virtual Organization (VO) Group of one or more researchers   Resource provider (RP) Operates Compute Elements and Storage Elements   Support Center (SC) SC provides support for one or more VO and/or RP   VO support centers Provide end user support including triage of user-related trouble tickets   Community Support Volunteer effort to provide SC for RP for VOs without their own SC, and general help discussion mail list 44

Operations Model

Lines represent communication paths and, in our model, agreements.

We have not progressed very far with agreements yet.

Real support organizations often play multiple roles Gray shading indicates that OSG Operations composed of effort from all the support centers

45

OSG Release Process   Applications  Integration  Provision  Deploy Integration Testbed (15-20) Production (50+) sites ITB OSG

Sao Paolo Taiwan, S.Korea

46

Integration Testbed  As reported in GridCat status catalog ITB release service facility site Ops map Tier 2 sites status 47

Release Schedule Incremental Updates Incremental Updates (minor release) Incremental Updates OSG 0.8.0

OSG 0.4.1

OSG 0.6.0

OSG 1.0.0!

OSG 0.4.0

SC4 CMS CSA06 Advanced LIGO 01/06 03/06 06/06 09/06 12/06 ATLAS Cosmic Ray Run 03/07 06/07 09/07 12/07 WLCG Service Commissioned 3/08 6/08 9/08 48

OSG Release Timeline OSG 0.2.1

OSG 0.4.0

Production OSG 0.4.1

ITB 0.1.2

ITB 0.1.6

ITB 0.3.0

ITB 0.3.4

ITB 0.3.7

ITB 0.5.0

OSG 0.6.0

Integration 49

Deployment and Maintenance  Distribute s/w through VDT and OSG caches.

 Progress technically via VDT weekly office hours - problems, help, planning - fed from multiple sources (Ops, Int, VDT-Support, mail, phone).

 Publish plans and problems through VDT “To do list”, Int-Twiki and ticket systems.

 Critical updates and patches follow Standard Operating Procedures. 50

Release Functionality   OSG 0.6 Fall 2006 Accounting;  Squid (Web caching in support of s/w distribution + database information);  OSG 0.8 Spring 2007 VM based Edge Services; Just in time job scheduling, Pull Mode Condor-C,  SRM V2+AuthZ;  CEMON-ClassAd based Resource Selection.

 Support for sites to run Pilot jobs and/or Glide-ins using gLexec for identity changes .

 Support for MDS-4.

OSG1.0 End of 2007 51

Inter-operability with Campus grids FermiGrid is an interesting example for the challenges we face when making the resources of a campus (in this case a DOE Laboratory) grid accessible to the OSG community 52

OSG Principles       Characteristics Provide guaranteed and opportunistic access to shared resources.

Operate a heterogeneous environment both in services available at any site and for any VO, and multiple implementations behind common interfaces.

Interface to Campus and Regional Grids. Federate with other national/international Grids.

Support multiple software releases at any one time.

     Drivers Delivery to the schedule, capacity and capability of LHC and LIGO:  Contributions to/from and collaboration with the US ATLAS, US CMS, LIGO software and computing programs. Support for/collaboration with other physics/non-physics communities.

Partnerships with other Grids - especially EGEE and TeraGrid.

Evolution by deployment of externally developed new services and technologies:.

53

OSG Middleware Layering LIGO Data Grid CMS Services & Framework ATLAS Services & Framework CDF, D0 SamGrid & Framework … OSG Release Cache: VDT + Configuration, Validation, VO management Virtual Data Toolkit (VDT) Common Services NMI + VOMS, CEMon (common EGEE components), MonaLisa, Clarens, AuthZ NSF Middleware Initiative (NMI): Condor, Globus, Myproxy 54

Summary  OSG facility opened July 22nd 2005.

 — — — OSG facility is under steady use ~2-3000 jobs at all times HEP but large Bio/Eng/Med occasionally Moderate other physics - Astro/Nuclear - LIGO expected to ramp up.

 — OSG project — 5 year Proposal to DOE & NSF funded starting 9/06 Facility & Improve/Expand/Extend/Interoperate & E&O  — — — Off to a running start … but lot’s more to do.

— — — Routinely exceeding 1Gbps at 3 sites Scale by x4 by 2008 and many more sites Routinely exceeding 1000 running jobs per client Scale by at least x10 by 2008 Have reached 99% success rate for 10,000 jobs per day submission Need to reach this routinely, even under heavy load 55

EGEE–OSG inter-operability        Agree on a common Virtual Organization Management System (VOMS) Active Joint Security groups: leading to common policies and procedures.

Condor-G interfaces to multiple remote job execution services (GRAM, Condor-C).

File Transfers using GridFTP.

SRM V1.1 for managed storage access. SRM V2.1 in test. Publish OSG BDII to shared BDII for Resource Brokers to route jobs across the two grids.

Automate ticket routing between GOCs. 56

What is FermiGrid?

 Integrates resources across most (soon all) owners at Fermilab.

 Supports jobs from Fermilab organizations to run on any/all accessible campus FermiGrid and national Open Science Grid resources.  Supports jobs from OSG to be scheduled onto any/all Fermilab sites,.

 Unified and reliable common interface and services for FermiGrid gateway - including security, job scheduling, user management, and storage.  More information is available at http://fermigrid.fnal.gov

57

Job Forwarding and Resource Sharing  Gateway currently interfaces 5 Condor pools with diverse file systems and >1000 Job Slots. Plans to grow to 11 clusters (8 Condor, 2 PBS and 1 LSF)  Job scheduling policies and in place agreements for sharing allow fast response to changes in resource needs by Fermilab and OSG users.

 Gateway provides single bridge between OSG wide area distributed infrastructure and FermiGrid local sites. Consists of a Glbus gate-keeper and a Condor-G  Each cluster has its own Globus gate-keeper  Storage and Job execution policies applied through Site-wide managed security and authorization services.

58

Access to FermiGrid

Fermilab Users

GT-GK FermiGrid Gateway Condor-G GT-GK

CDF Condor pool

GT-GK

DZero Condor pool

GT-GK

Shared Condor pool OSG General Users

OSG “agreed” Users GT-GK

CMS Condor pool

59

GLOW: UW Enterprise Grid

• Condor pools at various departments integrated into a campus wide grid –Grid Laboratory of Wisconsin • Older private Condor pools at other departments – ~1000 ~1GHz Intel CPUs at CS – ~100 ~2GHz Intel CPUs at Physics – … • Condor jobs flock from on-campus and of-campus to GLOW • Excellent utilization – Especially when the Condor Standard Universe is used • Premption, Checkpointing, Job Migration 4/23/2020 60

Grid Laboratory of Wisconsin

2003 Initiative funded by NSF/UW Six GLOW Sites • Computational Genomics, Chemistry • Amanda, Ice-cube, Physics/Space Science • High Energy Physics/CMS, Physics • Materials by Design, Chemical Engineering • Radiation Therapy, Medical Physics • Computer Science GLOW phases-1,2 + non-GLOW funded nodes have ~1000 Xeons + 100 TB disk 4/23/2020 61

How does it work?

• Each of the six sites manages a local Condor pool with its own collector and matchmaker • Through the High Availability Demon (HAD) service offered by Condor, one of these matchmaker is elected to manage all GLOW resources 4/23/2020 62

GLOW Deployment

• GLOW is fully Commissioned and is in constant use – CPU • 66 GLOW +

50 ATLAS

+

108 other nodes

@ CS • 74 GLOW +

66 CMS nodes

• 93 GLOW nodes @ ChemE @ Physics • 66 GLOW nodes @ LMCG, MedPhys, Physics • 95 GLOW nodes @ MedPhys • 60 GLOW nodes @ IceCube • Total CPU: ~1339 – Storage • Head nodes @ at all sites QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

• 45 TB each @ CS and Physics • Total storage: ~ 100 TB • GLOW Resources are used at 100% level – Key is to have multiple user groups • GLOW continues to grow 4/23/2020 63

GLOW Usage

• GLOW Nodes are always running hot!

– CS + Guests • Largest user • Serving guests - many cycles delivered to guests!

– ChemE • Largest community – HEP/CMS • Production for collaboration • Production and analysis of local physicists – LMCG • Standard Universe – Medical Physics • MPI jobs – IceCube • Simulations 4/23/2020 64

GLOW Usage 3/04 – 9/05

Leftover cycles available for “Others” Takes advantage of “shadow” jobs Take advantage of check-pointing jobs Over 7.6 million CPU-Hours (865 CPU-Years) served!

4/23/2020 65

Example Uses

• ATLAS – Over 15 Million proton collision events simulated at 10 minutes each • CMS – Over 70 Million events simulated, reconstructed and analyzed (total ~10 minutes per event) in the past one year • IceCube / Amanda – Data filtering used 12 years of GLOW CPU in one month • Computational Genomics – Prof. Shwartz asserts that GLOW has opened up a new paradigm of work patterns in his group • They no longer think about how long a particular computational job will take - they just do it • Chemical Engineering – Students do not know where the computing cycles are coming from - they just do it - largest user group 4/23/2020 66

Open Science Grid & GLOW

• OSG Jobs can run on GLOW – Gatekeeper routes jobs to local condor cluster – Jobs flock to campus wide, including the GLOW resources – dCache storage pool is also a registered OSG storage resource – Beginning to see some use • Now actively working on rerouting GLOW jobs to the rest of OSG – Users do NOT have to adapt to OSG interface and separately manage their OSG jobs – New Condor code development 4/23/2020 67

Elevating from GLOW to OSG

Specialized scheduler operating on schedd’s jobs.

Job 1 Job 2 Job 3 Job 4 Job 5 … Job 4* job queue Schedd On The Side Schedd

www.cs.wisc.edu/~miron

68

The Grid Universe

Gatekeeper Random Seed Seed Random Seed Seed Seed Seed vanilla site X Schedd X Startds •easier to live with private networks •may use non-Condor resources •restricted Condor feature set (e.g. no std universe over grid) •must pre-allocating jobs between vanilla and grid universe

www.cs.wisc.edu/~miron

69

Dynamic Routing Jobs

•dynamic allocation of jobs between vanilla and grid universes.

•not every job is appropriate for transformation into a grid job.

Random Random Seed Random Random Seed Random Seed Random Seed Random Seed vanilla site X site Y Schedd site Z Schedd On The Side Gatekeeper X Z Local Startds Y

www.cs.wisc.edu/~miron

70

Final Observation … A production grid is the product of a complex interplay of many forces:  Resource providers   Users Software providers      Hardware trends Commercial offerings Funding agencies Culture of all parties involved … 71