Open Science Grid Middleware

Download Report

Transcript Open Science Grid Middleware

Open Science Grid.. An introduction
Ruth Pordes
Fermilab
1
OSG Provenance
iVDGL (NSF)
GriPhyN (NSF)
PPDG
1999
2000
2001
Trillium
Grid3
OSG (DOE+NSF)
(DOE)
2002
2003
2004
2005
2006
2007
2008
200
2
Introducing myself
at Fermilab for 25 years (well and 2 years in the “pioneer” ‘70s),
started on data acquisition for High Energy Physics experiments,
a “builder” of the Sloan Digital Sky Survey,
led development of a common data acquisition system for 6
experiments at Fermilab (DART),
coordinator of the CDF/D0 Joint Run II offline projects (with Dane),
coordinator of the Particle Physics Data Grid SciDAC I
collaboratory,
founder of Trillium collaboration of iVDGL, GridPhyN, PPDG, and
GLUE interoperability between US and EU.
Now I am variously:
Executive Director of the Open Science Grid,
an Associate Head of the Computing Division at Fermilab, and
US CMS Grid Services and Interfaces Coordinator.
3
A Common Grid Infrastructure
4
Overlaid by community computational environments of single to
large groups of researchers located locally to worldwide
5
Grid of Grids - from Local to Global
National
Campus
Community
6
Current OSG deployment
96 Resources across
production & integration infrastructures
Using production & research networks
Jobs Running on OSG over 9 months
Sustaining through OSG submissions:
3,000-4,000 simultaneous jobs .
~10K jobs/day
~50K CPUhours/day.
Peak short validation jobs ~15K
20 Virtual Organizations +6 operations
Includes 25% non-physics.
~20,000 CPUs (from 30 to 4000 shared
between OSG and local use)
~6 PB Tapes
~4 PB Shared Disk
7
Examples of Sharing
last week of ATLAS
Site
ASGC_OSG
BU_ATLAS_Tier2
Max # Jobs
9
CIT_CMS_T2
99
FIU-PG
58
FNAL_GPFARM
17
OSG_LIGO_PSU
1
OU_OCHEP_SWT2
Purdue-ITaP
UC_ATLAS_MWT2
82
3
88
UFlorida-IHEPA
1
UFlorida-PG (CMS)
1
UMATLAS
UWMadisonCMS
VO
ATLAS
CDF
CMS
COMPBIOGRID
GADU
LIGO
Average # of Jobs (~300
batch slots)
CPUHours
#Jobs Completed
Max # Jobs
2
279
559
10
1
75
253
30,000
50,000
594
UWMilwaukee
2
osg-gw-2.t2.ucsd.edu
2
CPUHours
last week at UCSD -- CMS Site
154
55,000
8
OSG Consortium
Contributors
Project
9
OSG Project
10
OSG & its goals
Project receiving ~$6/M/Year for 5 years from DOE and NSF for effort to
sustain and evolve the distributed facility, bring on board new
communities and capabilities and EOT. Hardware resources contributed by
OSG Consortium members.
Goals:
Support data storage, distribution & computation for High Energy, Nuclear & Astro
Physics collaborations, in particular delivering to the needs of LHC and LIGO science.
Engage and benefit other Research & Science of all scales through progressively
supporting their applications.
Educate & train students, administrators & educators.
Provide a petascale Distributed Facility across the US with guaranteed & opportunistic
access to shared compute & storage resources.
Interface, Federate and Collaborate with Campus, Regional, other national & international
Grids, in particular with EGEE & TeraGrid.
Provide an Integrated, Robust Software Stack for Facility & Applications, tested on a well
provisioned at-scale validation facility.
Evolve the capabilities by deploying externally developed new technologies through joint
projects with the development groups.
11
Middleware Stack and Deployment
OSG Middleware is deployed on existing farms and storage systems.
OSG Middleware interfaces to the existing installations of OS, utilities and batch
systems.
VOs have VO scoped environments in which they deploy applications (and other files),
execute code and store data.
VOs are responsible for and have control over their end-to-end distributed system
using the OSG infrastructure.
End-to-end s/w Stack
Deployment into Production
Integration Grid has ~15 sites
12
OSG will support Global Data Transfer, Storage &
Access at GBytes/sec 365 days a year e.g. CMS
Data To / From Tape at Tier-1
Need to triple in ~1 year.
600MB/sec
Data to Disk Caches - Data Samples
200MB/sec
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
~7 Tier-1s, CERN + Tier-2s
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Tier-2 sites data distributed to
OSG must enable data placement, disk usage, resource managament policies, of
10s Gbit/Sec data movement, 10s Petabyte tape stores, local shared disk
caches of 100sTBs across 10s of sites for >10 VOs.
Data distribution will depend on & integrate to advanced network infrastructures:
Internet 2 will provide "layer 2” connectivity between OSG University Sites and peers in
Europe.
ESNET will provide "layer 2" connectivity between OSG DOE Laboratory sites and EU
GEANT network.
Both include the use of the IRNC link (NSF) from the US to Amsterdam
13
Security Infrastructure
• Identity: X509 Certificates. Authentication and Authorization using
VOMS extended attribute certficates.
• Security Process modelled on NIST procedural controls management, operational, technical, starting from an inventory of the
OSG assets.
• User and VO Management:






VO Registers with Operations Center
User registers through VOMRS or VO administrator
Site Registers with the Operations Center
Each VO centrally defines and assigns roles
Each Site provides role to access mappings based on VO/VOGroup.
Can reject individuals.
• Heterogeneous identity management systems – OSG vs
TeraGrid/EGEE , grid vs. local, compute vs. storage, head-node vs. ,
old-version vs. new-version. Issues include:




Cross domain right management
Right/identity management of software modules and resources
Error/rejection propagation
Solutions/approaches that work end-to-end
14
Education, Outreach, Training
Training Workshops for Administrators and Application Developers
e.g. Grid Summer Workshop (in 4th year)
Outreach e.g. Science Grid This Week
-> International Science Grid This Week
Education through e-Labs
15
OSG Initial Timeline & Milestones - Summary
Contribute to Worldwide LHC Computing Grid
LHC Event Data Distribution and Analysis
Support 1000 Users; 20PB Data Archive
LHC Simulations
Contribute to LIGO Workflow and Data Analysis
LIGO data run SC5
Advanced LIGO
LIGO Data Grid dependent on OSG
STAR, CDF, D0, Astrophysics
CDF Simulation
CDF Simulation and Analysis
D0 Simulations
D0 Reprocessing
STAR Data Distribution and Jobs
10KJobs per Day
Community +1 Community +1 Community +1 Community +1 Community
Additional Science+1
Communities
006
2007
2008
+1 Community +1 Community +1 Community +1 C
2009
2010
201
Facility Security : Risk Assessment, Audits, Incident Response, Management, Operations, Technical Controls
Plan V1
1st Audit
Risk
Audit
Risk
Audit
Risk
Assessment
Assessment
Assessment
Facility Operations and Metrics: Increase robustness and scale; Operational Metrics defined and validated each year.
Audit
Risk
Assessment
Interoperate and Federate with Campus and Regional Grids
VDT and OSG Software Releases: Major Release every 6 months; Minor Updates as needed
VDT 1.4.0 VDT 1.4.1 VDT 1.4.2
…
…
…
OSG 0.6.0 OSG 0.8.0 OSG 1.0
OSG 2.0
OSG 3.0
…
…
VDT
Incremental
Updates
dCache with Accounting Auditing
Federated monitoring and
role based
information services
VDS with SRM
authorization
Common S/w Distribution
Transparent data and job
with TeraGrid
movement with TeraGrid
EGEE using VDT 1.4.X Transparent data management with EGEE
Extended Capabilities & Increase Scalability and Performance for Jobs and Data to meet Stakeholder needs
SRM/dCache
“Just in Time” Workload VO Services
Integrated Network Management
Extensions
Management
Infrastructure
Data Analysis (batch and
Improved Workflow and Resource Selection
interactive) Workflow
Work with SciDAC-2 CEDS and Security with Open Science
2006
Project start
2007
2008
End of Phase I
2009
End of Phase II
2010
2011
16