Open Science Grid www.opensciencegrid.org Linking Universities and Laboratories In National Cyberinfrastructure

Download Report

Transcript Open Science Grid www.opensciencegrid.org Linking Universities and Laboratories In National Cyberinfrastructure

Open Science Grid
Linking Universities and Laboratories In National
Cyberinfrastructure
www.opensciencegrid.org
Physics Colloquium
RIT (Rochester, NY)
May 23, 2007
RIT Colloquium (May 23, 2007)
Paul Avery
University of Florida
[email protected]
Paul Avery
1
Cyberinfrastructure and Grids
 Grid:
Geographically distributed computing resources
configured for coordinated use
 Fabric:
Physical resources & networks providing raw capability
 Ownership: Resources controlled by owners and shared w/ others
 Middleware: Software tying it all together: tools, services, etc.
 Enhancing
collaboration via transparent resource sharing
US-CMS
“Virtual Organization”
RIT Colloquium (May 23, 2007)
Paul Avery
2
Motivation: Data Intensive Science
 21st
century scientific discovery
 Computationally
& data intensive
 Theory + experiment + simulation
 Internationally distributed resources and collaborations
 Dominant
 2000
 2007
 2013
 2020
 Powerful
factor: data growth (1 petabyte = 1000 terabytes)
~0.5 petabyte
~10 petabytes
~100 petabytes
~1000 petabytes
How to collect, manage,
access and interpret this
quantity of data?
cyberinfrastructure needed
 Computation
 Data
storage & access
 Data movement
 Data sharing
 Software
RIT Colloquium (May 23, 2007)
Massive, distributed CPU
Large-scale, distributed storage
International optical networks
Global collaborations (100s – 1000s)
Managing all of the above
Paul Avery
3
Open Science Grid: July 20, 2005
 Consortium
of many organizations (multiple disciplines)
 Production grid cyberinfrastructure
 80+ sites, 25,000+ CPUs: US, UK, Brazil, Taiwan
RIT Colloquium (May 23, 2007)
Paul Avery
4
The Open Science Grid Consortium
U.S. grid
projects
University
facilities
Multi-disciplinary
facilities
Science projects &
communities
LHC experiments
Open
Science
Grid
Regional and
campus grids
Education
communities
Computer
Science
Laboratory
centers
Technologists
(Network, HPC, …)
RIT Colloquium (May 23, 2007)
Paul Avery
5
Open Science Grid Basics
 Who
 Comp.
scientists, IT specialists, physicists, biologists, etc.
 What
 Shared
computing and storage resources
 High-speed production and research networks
 Meeting place for research groups, software experts, IT providers
 Vision
 Maintain
and operate a premier distributed computing facility
 Provide education and training opportunities in its use
 Expand reach & capacity to meet needs of stakeholders
 Dynamically integrate new resources and applications
 Members
and partners
 Members:
HPC facilities, campus, laboratory & regional grids
 Partners: Interoperation with TeraGrid, EGEE, NorduGrid, etc.
RIT Colloquium (May 23, 2007)
Paul Avery
6
Crucial Ingredients in Building OSG
 Science
“Push”: ATLAS, CMS, LIGO, SDSS
 1999:
 Early
Foresaw overwhelming need for distributed cyberinfrastructure
funding: “Trillium” consortium
 PPDG:
$12M (DOE)
(1999 – 2006)
 GriPhyN: $12M (NSF)
(2000 – 2006)
 iVDGL:
$14M (NSF)
(2001 – 2007)
 Supplements + new funded projects
 Social
networks: ~150 people with many overlaps
 Universities,
 Coordination:
labs, SDSC, foreign partners
pooling resources, developing broad goals
 Common
middleware: Virtual Data Toolkit (VDT)
 Multiple Grid deployments/testbeds using VDT
 Unified entity when collaborating internationally
 Historically, a strong driver for funding agency collaboration
RIT Colloquium (May 23, 2007)
Paul Avery
7
OSG History in Context
LIGO operation
LIGO preparation
LHC construction, preparation
LHC Ops
iVDGL(NSF)
GriPhyN(NSF)
Trillium Grid3
OSG (DOE+NSF)
PPDG (DOE)
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
European Grid + Worldwide LHC Computing Grid
Campus, regional grids
RIT Colloquium (May 23, 2007)
Paul Avery
8
Principal Science Drivers
 100s
of petabytes (LHC)
 Several petabytes
 LIGO
(gravity wave search)
 0.5
- several petabytes
 Digital
2002
astronomy
 10s
of petabytes
 10s of terabytes
 Other
2007
2005
2009
2001
2009
2007
2005
2003
sciences coming forward
 Bioinformatics
(10s of petabytes)
 Nanoscience
2001
Community growth
energy and nuclear physics
Data growth
 High
 Environmental
 Chemistry
 Applied
mathematics
 Materials Science?
RIT Colloquium (May 23, 2007)
Paul Avery
9
OSG Virtual Organizations
ATLAS
HEP/LHC
HEP experiment at CERN
CDF
HEP
HEP experiment at FermiLab
CMS
HEP/LHC
HEP experiment at CERN
DES
Digital astronomy
Dark Energy Survey
DOSAR
Regional grid
Regional grid in Southwest US
DZero
HEP
HEP experiment at FermiLab
DOSAR
Regional grid
Regional grid in Southwest
ENGAGE
Engagement effort
A place for new communities
FermiLab
Lab grid
HEP laboratory grid
fMRI
fMRI
Functional MRI
GADU
Bio
Bioinformatics effort at Argonne
Geant4
Software
Simulation project
GLOW
Campus grid
Campus grid U of Wisconsin, Madison
GRASE
Regional grid
Regional grid in Upstate NY
RIT Colloquium (May 23, 2007)
Paul Avery
10
OSG Virtual Organizations (2)
GridChem
Chemistry
GPN
Great Plains Network www.greatplains.net
GROW
Campus grid
Campus grid at U of Iowa
I2U2
EOT
E/O consortium
LIGO
Gravity waves
Gravitational wave experiment
Mariachi
Cosmic rays
Ultra-high energy cosmic rays
nanoHUB
Nanotech
Nanotechnology grid at Purdue
NWICG
Regional grid
Northwest Indiana regional grid
NYSGRID
NY State Grid
www.nysgrid.org
OSGEDU
EOT
OSG education/outreach
SBGRID
Structural biology
Structural biology @ Harvard
SDSS
Digital astronomy
Sloan Digital Sky Survey (Astro)
STAR
Nuclear physics
Nuclear physics experiment at Brookhaven
UFGrid
Campus grid
Campus grid at U of Florida
RIT Colloquium (May 23, 2007)
Quantum chemistry grid
Paul Avery
11
Partners: Federating with OSG

Campus and regional
 Grid
 Grid
Laboratory of Wisconsin (GLOW)
Operations Center at Indiana University (GOC)
 Grid Research and Education Group at Iowa (GROW)
 Northwest Indiana Computational Grid (NWICG)
 New York State Grid (NYSGrid) (in progress)
 Texas Internet Grid for Research and Education (TIGRE)
 nanoHUB (Purdue)
 LONI (Louisiana)

National
 Data Intensive
 TeraGrid

Science University Network (DISUN)
International
 Worldwide LHC Computing Grid Collaboration (WLCG)
 Enabling Grids for E-SciencE (EGEE)
 TWGrid (from Academica Sinica Grid Computing)
 Nordic Data Grid Facility (NorduGrid)
 Australian Partnerships for Advanced
RIT Colloquium (May 23, 2007)
Computing (APAC)
Paul Avery
12
Defining the Scale of OSG:
Experiments at Large Hadron Collider
 27 km Tunnel in Switzerland & France
TOTEM
CMS
LHC @ CERN
ALICE
LHCb
Search for
 Origin of Mass
 New fundamental forces
 Supersymmetry
 Other new particles
RIT –
Colloquium
(May 23, 2007)
 2007
?
ATLAS
Paul Avery
13
CMS: “Compact” Muon Solenoid
Inconsequential humans
RIT Colloquium (May 23, 2007)
Paul Avery
14
Collision Complexity: CPU + Storage
(+30 minimum bias events)
All charged tracks with pt > 2 GeV
Reconstructed tracks with pt > 25 GeV
109 collisions/sec, selectivity: 1 in 1013
RIT Colloquium (May 23, 2007)
Paul Avery
15
LHC Data and CPU Requirements
CMS
ATLAS
Storage




Raw recording rate 0.2 – 1.5 GB/s
Large Monte Carlo data samples
100 PB by ~2013
1000 PB later in decade?
Processing
 PetaOps (> 300,000 3 GHz PCs)
Users
 100s of institutes
 1000s of researchers
LHCb
RIT Colloquium (May 23, 2007)
Paul Avery
16
OSG and LHC Global Grid
 5000 physicists, 60 countries
 10s of Petabytes/yr by 2009
 CERN / Outside = 10-20%
CMS Experiment
Online
System
Tier 0
Tier 1
CERN Computer
Center
200 - 1500 MB/s
Korea
Russia
UK
10-40 Gb/s
FermiLab
>10 Gb/s
Tier 2
OSG
U Florida
Caltech
UCSD
2.5-10 Gb/s
Tier 3
Tier 4
FIU
Physics caches
RIT Colloquium (May 23, 2007)
Iowa
Maryland
PCs
Paul Avery
17
LHC Global Collaborations
CMS
ATLAS
2000 – 3000 physicists per experiment
 USA is 20–31% of total

RIT Colloquium (May 23, 2007)
Paul Avery
18
LIGO: Search for Gravity Waves
 LIGO
Grid
6
US sites
 3 EU sites (UK & Germany)
Birmingham•
Cardiff
AEI/Golm •
* LHO, LLO: LIGO observatory sites
* LSC:
LIGO Scientific Collaboration
RIT Colloquium (May 23, 2007)
Paul Avery
19
Sloan Digital Sky Survey: Mapping the Sky
RIT Colloquium (May 23, 2007)
Paul Avery
20
Bioinformatics: GADU / GNARE
Public Databases
Genomic databases available on the web.
Eg: NCBI, PIR, KEGG, EMP, InterPro, etc.
GADU using Grid
Applications executed on Grid as
workflows and results are stored in
integrated Database.
TeraGrid
DOE SG
Bidirectional Data Flow
•SEED
(Data Acquisition)
•Shewanella
Consortium
(Genome Analysis)
Others..
OSG
Services
to Other Groups
Integrated
Database
Applications (Web Interfaces) Based on the Integrated Database
Chisel
Protein Function Analysis
Tool.
PATHOS
Pathogenic DB for
Bio-defense research
GADU Performs:
 Acquisition: to acquire Genome
Data from a variety of publicly
available databases and store
temporarily on the file system.
 Analysis: to run different publicly
available tools and in-house tools
on the Grid using Acquired data &
data from Integrated database.
 Storage: Store the parsed data
acquired from public databases
and parsed results of the tools and
workflows used during analysis.
PUMA2
Evolutionary Analysis of
Metabolism
TARGET
Targets for Structural
analysis of proteins.
Phyloblocks
Evolutionary analysis of
protein families
Integrated Database Includes:
 Parsed Sequence Data and
Annotation Data from Public web
sources.
 Results of different tools used for
Analysis: Blast, Blocks, TMHMM,
…
GNARE – Genome Analysis Research Environment
RIT Colloquium (May 23, 2007)
Paul Avery
21
Bioinformatics (cont)
Shewanella oneidensis
genome
RIT Colloquium (May 23, 2007)
Paul Avery
22
Nanoscience Simulations
collaboration
learning modules
1881 sim. users
>53,000 simulations
Real users and real usage
>10,100 users
nanoHUB.org
seminars
courses, tutorials
online simulation
RIT Colloquium (May 23, 2007)
Paul Avery
23
OSG Engagement Effort
 Purpose:
 Led
Bring non-physics applications to OSG
by RENCI (UNC + NC State + Duke)
 Specific
targeted opportunities
 Develop
relationship
 Direct assistance with technical details of connecting to OSG
 Feedback
and new requirements for OSG infrastructure
 (To
facilitate inclusion of new communities)
 More & better documentation
 More automation
RIT Colloquium (May 23, 2007)
Paul Avery
24
OSG and the Virtual Data Toolkit
 VDT:
a collection of software
 Grid
software (Condor, Globus, VOMS, dCache, GUMS, Gratia, …)
 Virtual Data System
 Utilities
 VDT:
the basis for the OSG software stack
 Goal
is easy installation with automatic configuration
 Now widely used in other projects
 Has a growing support infrastructure
RIT Colloquium (May 23, 2007)
Paul Avery
25
Why Have the VDT?
 Everyone
 But
could download the software from the providers
the VDT:
 Figures
out dependencies between software
 Works with providers for bug fixes
 Automatic configures & packages software
 Tests everything on 15 platforms (and growing)
 Debian
3.1
 Fedora Core 3
 Fedora Core 4 (x86, x86-64)
 Fedora Core 4 (x86-64)
 RedHat Enterprise Linux 3 AS (x86, x86-64, ia64)
 RedHat Enterprise Linux 4 AS (x64, x86-64)
 ROCKS Linux 3.3
 Scientific Linux Fermi 3
 Scientific Linux Fermi 4 (x86, x86-64, ia64)
 SUSE Linux 9 (IA-64)
RIT Colloquium (May 23, 2007)
Paul Avery
26
VDT Growth Over 5 Years (1.6.1i now)
Both added and
removed software
VDT 1.3.9
VDT 1.3.6
For OSG 0.4
For OSG 0.2
VDT 1.1.8
Adopted by LCG
VDT 1.3.0
VDT 1.6.1
For OSG 0.6.0
VDT 1.0
Globus 2.0b
Condor-G 6.3.1
VDT 1.1.x
RIT Colloquium (May 23, 2007)
VDT 1.2.x
VDT 1.3.x
Paul Avery
VDT 1.4.0
VDT 1.5.x
07
Ja
n-
6
l- 0
Ju
06
Ja
n-
5
l- 0
nJa
Ju
05
VDT 1.2.0
4
l- 0
Ju
04
Ja
n-
3
l- 0
Ju
Ja
n-
2
03
VDT 1.1.11
Grid2003
l- 0
nJa
More dev releases
Ju
50
45
40
35
30
25
20
15
10
5
0
02
Number
of major components
of Components
#
vdt.cs.wisc.edu
VDT 1.6.x
27
Collaboration with Internet2
www.internet2.edu
RIT Colloquium (May 23, 2007)
Paul Avery
28
Collaboration with National Lambda Rail
www.nlr.net
Optical, multi-wavelength community owned or leased “dark fiber”
(10 GbE) networks for R&E
 Spawning state-wide and regional networks (FLR, SURA, LONI, …)
 Bulletin: NLR-Internet2 merger announcement

RIT Colloquium (May 23, 2007)
Paul Avery
29
UltraLight
Integrating Advanced Networking in Applications
http://www.ultralight.org
Funded by NSF
RIT Colloquium (May 23, 2007)
Paul Avery
10 Gb/s+ network
• Caltech, UF, FIU, UM, MIT
• SLAC, FNAL
• Int’l partners
30
• Level(3), Cisco, NLR
REDDnet: National Networked Storage
NSF
funded project
 Vandebilt
8
initial sites
Multiple
disciplines
 Satellite
imagery
 HEP
 Terascale
Supernova
Initative
 Structural Biology
 Bioinformatics
Storage
 500TB
disk
 200TB tape
Brazil?
RIT Colloquium (May 23, 2007)
Paul Avery
31
OSG Jobs Snapshot: 6 Months
5000 simultaneous jobs
from multiple VOs
Sep
Oct
RIT Colloquium (May 23, 2007)
Nov
Dec
Paul Avery
Jan
Feb
Mar
32
OSG Jobs Per Site: 6 Months
5000 simultaneous jobs
at multiple sites
Sep
Oct
RIT Colloquium (May 23, 2007)
Nov
Dec
Paul Avery
Jan
Feb
Mar
33
Completed Jobs/Week on OSG
400K
Sep
CMS “Data Challenge”
Oct
RIT Colloquium (May 23, 2007)
Nov
Dec
Paul Avery
Jan
Feb
Mar
34
# Jobs Per VO
New Accounting System
(Gratia)
RIT Colloquium (May 23, 2007)
Paul Avery
35
Massive 2007 Data Reprocessing
by D0 Experiment @ Fermilab
LCG
~ 400M total
~ 250M OSG
OSG
SAM
RIT Colloquium (May 23, 2007)
Paul Avery
36
CDF Discovery of Bs Oscillations
Bs  Bs
xs
f 
 2.8THz
2
Bs  et / 2 sin  xst /   Bs1  cos  xst /   Bs 2 
RIT Colloquium (May 23, 2007)
Paul Avery
37
Communications:
International Science Grid This Week
SGTW  iSGTW
From April 2005
Diverse audience
>1000 subscribers
www.isgtw.org
RIT Colloquium (May 23, 2007)
Paul Avery
38
OSG News: Monthly Newsletter
18 issues by Apr. 2007
www.opensciencegrid.org/
osgnews
RIT Colloquium (May 23, 2007)
Paul Avery
39
Grid Summer Schools
 Summer
2004, 2005, 2006
1
week @ South Padre Island, Texas
 Lectures plus hands-on exercises to ~40 students
 Students of differing backgrounds (physics + CS), minorities
 Reaching
a wider audience
 Lectures,
exercises, video, on web
 More tutorials, 3-4/year
 Students, postdocs, scientists
 Agency specific tutorials
RIT Colloquium (May 23, 2007)
Paul Avery
40
Project Challenges
 Technical
constraints
 Commercial
tools fall far short, require (too much) invention
 Integration of advanced CI, e.g. networks
 Financial
constraints (slide)
 Fragmented
& short term funding injections (recent $30M/5 years)
 Fragmentation of individual efforts
 Distributed
coordination and management
 Tighter
organization within member projects compared to OSG
 Coordination of schedules & milestones
 Many phone/video meetings, travel
 Knowledge dispersed, few people have broad overview
RIT Colloquium (May 23, 2007)
Paul Avery
41
Funding & Milestones: 1999 – 2007
Grid Communications
First US-LHC
Grid Testbeds
UltraLight, $2M
GriPhyN, $12M
Grid3 start
iVDGL, $14M
2000
2001
LIGO Grid
2002
VDT 1.0
2003


Grid, networking projects
Large experiments
Education, outreach, training
RIT Colloquium (May 23, 2007)
VDT 1.3
2004
OSG start
2005
2006
LHC start
2007
CHEPREO, $4M
PPDG, $9.5M

DISUN, $10M
Grid Summer Schools OSG, $30M NSF,
2004, 2005, 2006
DOE
Digital Divide Workshops
04, 05, 06
Paul Avery
42
Challenges from Diversity and Growth
 Management
of an increasingly diverse enterprise
 Sci/Eng
projects, organizations, disciplines as distinct cultures
 Accommodating new member communities (expectations?)
 Interoperation
with other grids
 TeraGrid
 International
partners (EGEE, NorduGrid, etc.)
 Multiple campus and regional grids
 Education,
outreach and training
 Training
for researchers, students
 … but also project PIs, program officers
 Operating
a rapidly growing cyberinfrastructure
 100K CPUs, 4  10 PB disk
 Management of and access to rapidly increasing data stores (slide)
 Monitoring, accounting, achieving high utilization
 Scalability of support model (slide)
 25K
RIT Colloquium (May 23, 2007)
Paul Avery
43
Rapid Cyberinfrastructure Growth: LHC
Meeting
LHC service challenges & milestones
Participating in worldwide simulation productions
350
200
2008: ~140,000
PCs
150
Tier-1
MSI2000
250
Tier-2
300
100
0
2007
RIT Colloquium (May 23, 2007)
CERN
50
LHCb-Tier-2
CMS-Tier-2
ATLAS-Tier-2
ALICE-Tier-2
LHCb-Tier-1
CMS-Tier-1
ATLAS-Tier-1
ALICE-Tier-1
LHCb-CERN
CMS-CERN
ATLAS-CERN
ALICE-CERN
2008
2009
Year
Paul Avery
2010
44
OSG Operations
Distributed model
 Scalability!
 VOs, sites, providers
 Rigorous problem
tracking & routing
 Security
 Provisioning
 Monitoring
 Reporting
Partners with EGEE operations
RIT Colloquium (May 23, 2007)
Paul Avery
45
Five Year Project Timeline & Milestones
Contribute to Worldwide LHC Computing Grid
LHC Event Data Distribution and Analysis
Support 1000 Users; 20PB Data Archive
LHC Simulations
Contribute to LIGO Workflow and Data Analysis
LIGO data run SC5
Advanced LIGO
LIGO Data Grid dependent on OSG
STAR, CDF, D0, Astrophysics
CDF Simulation
CDF Simulation and Analysis
D0 Simulations
D0 Reprocessing
STAR Data Distribution and Jobs
Additional Science Communities
006
10KJobs per Day
+1 Community +1 Community +1 Community +1 Community
2007
2008
+1 Community +1 Community +1 Community +1 C
2009
2010
201
Facility Security : Risk Assessment, Audits, Incident Response, Management, Operations, Technical Controls
Plan V1
1st Audit
Risk
Audit
Risk
Audit
Risk
Assessment
Assessment
Assessment
Facility Operations and Metrics: Increase robustness and scale; Operational Metrics defined and validated each year.
Audit
Risk
Assessment
Interoperate and Federate with Campus and Regional Grids
VDT and OSG Software Releases: Major Release every 6 months; Minor Updates as needed
VDT 1.4.0 VDT 1.4.1 VDT 1.4.2
…
…
…
OSG 0.6.0 OSG 0.8.0 OSG 1.0
OSG 2.0
OSG 3.0
…
…
VDT
Incremental
Updates
dCache with Accounting Auditing
Federated monitoring and
role based
information services
VDS with SRM
authorization
Common S/w Distribution
Transparent data and job
with TeraGrid
movement with TeraGrid
EGEE using VDT 1.4.X Transparent data management with EGEE
Extended Capabilities & Increase Scalability and Performance for Jobs and Data to meet Stakeholder needs
SRM/dCache
“Just in Time” Workload VO Services
Integrated Network Management
Extensions
Management
Infrastructure
Data Analysis (batch and
Improved Workflow and Resource Selection
interactive) Workflow
Work with SciDAC-2 CEDS and Security with Open Science
2006
2007
2008
RIT Colloquium (May 23, 2007)
Project start
End of Phase I
2009
Paul Avery
End of Phase II
2010
2011
46
Extra Slides
RIT Colloquium (May 23, 2007)
Paul Avery
47
VDT Release Process (Subway Map)
Gather requirements
Time
Day 0
Build software
Test
Validation test bed
VDT Release
ITB Release Candidate
Integration test bed
Day N
OSG Release
From Alain Roy
RIT Colloquium (May 23, 2007)
Paul Avery
48
VDT Challenges
 How
should we smoothly update a production service?
 In-place
vs. on-the-side
 Preserve old configuration while making big changes
 Still takes hours to fully install and set up from scratch
 How
do we support more platforms?
A
struggle to keep up with the onslaught of Linux distributions
 AIX? Mac OS X? Solaris?
 How
can we accommodate native packaging formats?
 RPM
 Deb
Fedora Core 6
BCCD
RIT Colloquium (May 23, 2007)
Paul Avery
Fedora Core 4
RHEL 3
49 4
Fedora Core 3 RHEL