Open Science Grid www.opensciencegrid.org Linking Universities and Laboratories In National Cyberinfrastructure
Download
Report
Transcript Open Science Grid www.opensciencegrid.org Linking Universities and Laboratories In National Cyberinfrastructure
Open Science Grid
Linking Universities and Laboratories In National
Cyberinfrastructure
www.opensciencegrid.org
Physics Colloquium
RIT (Rochester, NY)
May 23, 2007
RIT Colloquium (May 23, 2007)
Paul Avery
University of Florida
[email protected]
Paul Avery
1
Cyberinfrastructure and Grids
Grid:
Geographically distributed computing resources
configured for coordinated use
Fabric:
Physical resources & networks providing raw capability
Ownership: Resources controlled by owners and shared w/ others
Middleware: Software tying it all together: tools, services, etc.
Enhancing
collaboration via transparent resource sharing
US-CMS
“Virtual Organization”
RIT Colloquium (May 23, 2007)
Paul Avery
2
Motivation: Data Intensive Science
21st
century scientific discovery
Computationally
& data intensive
Theory + experiment + simulation
Internationally distributed resources and collaborations
Dominant
2000
2007
2013
2020
Powerful
factor: data growth (1 petabyte = 1000 terabytes)
~0.5 petabyte
~10 petabytes
~100 petabytes
~1000 petabytes
How to collect, manage,
access and interpret this
quantity of data?
cyberinfrastructure needed
Computation
Data
storage & access
Data movement
Data sharing
Software
RIT Colloquium (May 23, 2007)
Massive, distributed CPU
Large-scale, distributed storage
International optical networks
Global collaborations (100s – 1000s)
Managing all of the above
Paul Avery
3
Open Science Grid: July 20, 2005
Consortium
of many organizations (multiple disciplines)
Production grid cyberinfrastructure
80+ sites, 25,000+ CPUs: US, UK, Brazil, Taiwan
RIT Colloquium (May 23, 2007)
Paul Avery
4
The Open Science Grid Consortium
U.S. grid
projects
University
facilities
Multi-disciplinary
facilities
Science projects &
communities
LHC experiments
Open
Science
Grid
Regional and
campus grids
Education
communities
Computer
Science
Laboratory
centers
Technologists
(Network, HPC, …)
RIT Colloquium (May 23, 2007)
Paul Avery
5
Open Science Grid Basics
Who
Comp.
scientists, IT specialists, physicists, biologists, etc.
What
Shared
computing and storage resources
High-speed production and research networks
Meeting place for research groups, software experts, IT providers
Vision
Maintain
and operate a premier distributed computing facility
Provide education and training opportunities in its use
Expand reach & capacity to meet needs of stakeholders
Dynamically integrate new resources and applications
Members
and partners
Members:
HPC facilities, campus, laboratory & regional grids
Partners: Interoperation with TeraGrid, EGEE, NorduGrid, etc.
RIT Colloquium (May 23, 2007)
Paul Avery
6
Crucial Ingredients in Building OSG
Science
“Push”: ATLAS, CMS, LIGO, SDSS
1999:
Early
Foresaw overwhelming need for distributed cyberinfrastructure
funding: “Trillium” consortium
PPDG:
$12M (DOE)
(1999 – 2006)
GriPhyN: $12M (NSF)
(2000 – 2006)
iVDGL:
$14M (NSF)
(2001 – 2007)
Supplements + new funded projects
Social
networks: ~150 people with many overlaps
Universities,
Coordination:
labs, SDSC, foreign partners
pooling resources, developing broad goals
Common
middleware: Virtual Data Toolkit (VDT)
Multiple Grid deployments/testbeds using VDT
Unified entity when collaborating internationally
Historically, a strong driver for funding agency collaboration
RIT Colloquium (May 23, 2007)
Paul Avery
7
OSG History in Context
LIGO operation
LIGO preparation
LHC construction, preparation
LHC Ops
iVDGL(NSF)
GriPhyN(NSF)
Trillium Grid3
OSG (DOE+NSF)
PPDG (DOE)
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
European Grid + Worldwide LHC Computing Grid
Campus, regional grids
RIT Colloquium (May 23, 2007)
Paul Avery
8
Principal Science Drivers
100s
of petabytes (LHC)
Several petabytes
LIGO
(gravity wave search)
0.5
- several petabytes
Digital
2002
astronomy
10s
of petabytes
10s of terabytes
Other
2007
2005
2009
2001
2009
2007
2005
2003
sciences coming forward
Bioinformatics
(10s of petabytes)
Nanoscience
2001
Community growth
energy and nuclear physics
Data growth
High
Environmental
Chemistry
Applied
mathematics
Materials Science?
RIT Colloquium (May 23, 2007)
Paul Avery
9
OSG Virtual Organizations
ATLAS
HEP/LHC
HEP experiment at CERN
CDF
HEP
HEP experiment at FermiLab
CMS
HEP/LHC
HEP experiment at CERN
DES
Digital astronomy
Dark Energy Survey
DOSAR
Regional grid
Regional grid in Southwest US
DZero
HEP
HEP experiment at FermiLab
DOSAR
Regional grid
Regional grid in Southwest
ENGAGE
Engagement effort
A place for new communities
FermiLab
Lab grid
HEP laboratory grid
fMRI
fMRI
Functional MRI
GADU
Bio
Bioinformatics effort at Argonne
Geant4
Software
Simulation project
GLOW
Campus grid
Campus grid U of Wisconsin, Madison
GRASE
Regional grid
Regional grid in Upstate NY
RIT Colloquium (May 23, 2007)
Paul Avery
10
OSG Virtual Organizations (2)
GridChem
Chemistry
GPN
Great Plains Network www.greatplains.net
GROW
Campus grid
Campus grid at U of Iowa
I2U2
EOT
E/O consortium
LIGO
Gravity waves
Gravitational wave experiment
Mariachi
Cosmic rays
Ultra-high energy cosmic rays
nanoHUB
Nanotech
Nanotechnology grid at Purdue
NWICG
Regional grid
Northwest Indiana regional grid
NYSGRID
NY State Grid
www.nysgrid.org
OSGEDU
EOT
OSG education/outreach
SBGRID
Structural biology
Structural biology @ Harvard
SDSS
Digital astronomy
Sloan Digital Sky Survey (Astro)
STAR
Nuclear physics
Nuclear physics experiment at Brookhaven
UFGrid
Campus grid
Campus grid at U of Florida
RIT Colloquium (May 23, 2007)
Quantum chemistry grid
Paul Avery
11
Partners: Federating with OSG
Campus and regional
Grid
Grid
Laboratory of Wisconsin (GLOW)
Operations Center at Indiana University (GOC)
Grid Research and Education Group at Iowa (GROW)
Northwest Indiana Computational Grid (NWICG)
New York State Grid (NYSGrid) (in progress)
Texas Internet Grid for Research and Education (TIGRE)
nanoHUB (Purdue)
LONI (Louisiana)
National
Data Intensive
TeraGrid
Science University Network (DISUN)
International
Worldwide LHC Computing Grid Collaboration (WLCG)
Enabling Grids for E-SciencE (EGEE)
TWGrid (from Academica Sinica Grid Computing)
Nordic Data Grid Facility (NorduGrid)
Australian Partnerships for Advanced
RIT Colloquium (May 23, 2007)
Computing (APAC)
Paul Avery
12
Defining the Scale of OSG:
Experiments at Large Hadron Collider
27 km Tunnel in Switzerland & France
TOTEM
CMS
LHC @ CERN
ALICE
LHCb
Search for
Origin of Mass
New fundamental forces
Supersymmetry
Other new particles
RIT –
Colloquium
(May 23, 2007)
2007
?
ATLAS
Paul Avery
13
CMS: “Compact” Muon Solenoid
Inconsequential humans
RIT Colloquium (May 23, 2007)
Paul Avery
14
Collision Complexity: CPU + Storage
(+30 minimum bias events)
All charged tracks with pt > 2 GeV
Reconstructed tracks with pt > 25 GeV
109 collisions/sec, selectivity: 1 in 1013
RIT Colloquium (May 23, 2007)
Paul Avery
15
LHC Data and CPU Requirements
CMS
ATLAS
Storage
Raw recording rate 0.2 – 1.5 GB/s
Large Monte Carlo data samples
100 PB by ~2013
1000 PB later in decade?
Processing
PetaOps (> 300,000 3 GHz PCs)
Users
100s of institutes
1000s of researchers
LHCb
RIT Colloquium (May 23, 2007)
Paul Avery
16
OSG and LHC Global Grid
5000 physicists, 60 countries
10s of Petabytes/yr by 2009
CERN / Outside = 10-20%
CMS Experiment
Online
System
Tier 0
Tier 1
CERN Computer
Center
200 - 1500 MB/s
Korea
Russia
UK
10-40 Gb/s
FermiLab
>10 Gb/s
Tier 2
OSG
U Florida
Caltech
UCSD
2.5-10 Gb/s
Tier 3
Tier 4
FIU
Physics caches
RIT Colloquium (May 23, 2007)
Iowa
Maryland
PCs
Paul Avery
17
LHC Global Collaborations
CMS
ATLAS
2000 – 3000 physicists per experiment
USA is 20–31% of total
RIT Colloquium (May 23, 2007)
Paul Avery
18
LIGO: Search for Gravity Waves
LIGO
Grid
6
US sites
3 EU sites (UK & Germany)
Birmingham•
Cardiff
AEI/Golm •
* LHO, LLO: LIGO observatory sites
* LSC:
LIGO Scientific Collaboration
RIT Colloquium (May 23, 2007)
Paul Avery
19
Sloan Digital Sky Survey: Mapping the Sky
RIT Colloquium (May 23, 2007)
Paul Avery
20
Bioinformatics: GADU / GNARE
Public Databases
Genomic databases available on the web.
Eg: NCBI, PIR, KEGG, EMP, InterPro, etc.
GADU using Grid
Applications executed on Grid as
workflows and results are stored in
integrated Database.
TeraGrid
DOE SG
Bidirectional Data Flow
•SEED
(Data Acquisition)
•Shewanella
Consortium
(Genome Analysis)
Others..
OSG
Services
to Other Groups
Integrated
Database
Applications (Web Interfaces) Based on the Integrated Database
Chisel
Protein Function Analysis
Tool.
PATHOS
Pathogenic DB for
Bio-defense research
GADU Performs:
Acquisition: to acquire Genome
Data from a variety of publicly
available databases and store
temporarily on the file system.
Analysis: to run different publicly
available tools and in-house tools
on the Grid using Acquired data &
data from Integrated database.
Storage: Store the parsed data
acquired from public databases
and parsed results of the tools and
workflows used during analysis.
PUMA2
Evolutionary Analysis of
Metabolism
TARGET
Targets for Structural
analysis of proteins.
Phyloblocks
Evolutionary analysis of
protein families
Integrated Database Includes:
Parsed Sequence Data and
Annotation Data from Public web
sources.
Results of different tools used for
Analysis: Blast, Blocks, TMHMM,
…
GNARE – Genome Analysis Research Environment
RIT Colloquium (May 23, 2007)
Paul Avery
21
Bioinformatics (cont)
Shewanella oneidensis
genome
RIT Colloquium (May 23, 2007)
Paul Avery
22
Nanoscience Simulations
collaboration
learning modules
1881 sim. users
>53,000 simulations
Real users and real usage
>10,100 users
nanoHUB.org
seminars
courses, tutorials
online simulation
RIT Colloquium (May 23, 2007)
Paul Avery
23
OSG Engagement Effort
Purpose:
Led
Bring non-physics applications to OSG
by RENCI (UNC + NC State + Duke)
Specific
targeted opportunities
Develop
relationship
Direct assistance with technical details of connecting to OSG
Feedback
and new requirements for OSG infrastructure
(To
facilitate inclusion of new communities)
More & better documentation
More automation
RIT Colloquium (May 23, 2007)
Paul Avery
24
OSG and the Virtual Data Toolkit
VDT:
a collection of software
Grid
software (Condor, Globus, VOMS, dCache, GUMS, Gratia, …)
Virtual Data System
Utilities
VDT:
the basis for the OSG software stack
Goal
is easy installation with automatic configuration
Now widely used in other projects
Has a growing support infrastructure
RIT Colloquium (May 23, 2007)
Paul Avery
25
Why Have the VDT?
Everyone
But
could download the software from the providers
the VDT:
Figures
out dependencies between software
Works with providers for bug fixes
Automatic configures & packages software
Tests everything on 15 platforms (and growing)
Debian
3.1
Fedora Core 3
Fedora Core 4 (x86, x86-64)
Fedora Core 4 (x86-64)
RedHat Enterprise Linux 3 AS (x86, x86-64, ia64)
RedHat Enterprise Linux 4 AS (x64, x86-64)
ROCKS Linux 3.3
Scientific Linux Fermi 3
Scientific Linux Fermi 4 (x86, x86-64, ia64)
SUSE Linux 9 (IA-64)
RIT Colloquium (May 23, 2007)
Paul Avery
26
VDT Growth Over 5 Years (1.6.1i now)
Both added and
removed software
VDT 1.3.9
VDT 1.3.6
For OSG 0.4
For OSG 0.2
VDT 1.1.8
Adopted by LCG
VDT 1.3.0
VDT 1.6.1
For OSG 0.6.0
VDT 1.0
Globus 2.0b
Condor-G 6.3.1
VDT 1.1.x
RIT Colloquium (May 23, 2007)
VDT 1.2.x
VDT 1.3.x
Paul Avery
VDT 1.4.0
VDT 1.5.x
07
Ja
n-
6
l- 0
Ju
06
Ja
n-
5
l- 0
nJa
Ju
05
VDT 1.2.0
4
l- 0
Ju
04
Ja
n-
3
l- 0
Ju
Ja
n-
2
03
VDT 1.1.11
Grid2003
l- 0
nJa
More dev releases
Ju
50
45
40
35
30
25
20
15
10
5
0
02
Number
of major components
of Components
#
vdt.cs.wisc.edu
VDT 1.6.x
27
Collaboration with Internet2
www.internet2.edu
RIT Colloquium (May 23, 2007)
Paul Avery
28
Collaboration with National Lambda Rail
www.nlr.net
Optical, multi-wavelength community owned or leased “dark fiber”
(10 GbE) networks for R&E
Spawning state-wide and regional networks (FLR, SURA, LONI, …)
Bulletin: NLR-Internet2 merger announcement
RIT Colloquium (May 23, 2007)
Paul Avery
29
UltraLight
Integrating Advanced Networking in Applications
http://www.ultralight.org
Funded by NSF
RIT Colloquium (May 23, 2007)
Paul Avery
10 Gb/s+ network
• Caltech, UF, FIU, UM, MIT
• SLAC, FNAL
• Int’l partners
30
• Level(3), Cisco, NLR
REDDnet: National Networked Storage
NSF
funded project
Vandebilt
8
initial sites
Multiple
disciplines
Satellite
imagery
HEP
Terascale
Supernova
Initative
Structural Biology
Bioinformatics
Storage
500TB
disk
200TB tape
Brazil?
RIT Colloquium (May 23, 2007)
Paul Avery
31
OSG Jobs Snapshot: 6 Months
5000 simultaneous jobs
from multiple VOs
Sep
Oct
RIT Colloquium (May 23, 2007)
Nov
Dec
Paul Avery
Jan
Feb
Mar
32
OSG Jobs Per Site: 6 Months
5000 simultaneous jobs
at multiple sites
Sep
Oct
RIT Colloquium (May 23, 2007)
Nov
Dec
Paul Avery
Jan
Feb
Mar
33
Completed Jobs/Week on OSG
400K
Sep
CMS “Data Challenge”
Oct
RIT Colloquium (May 23, 2007)
Nov
Dec
Paul Avery
Jan
Feb
Mar
34
# Jobs Per VO
New Accounting System
(Gratia)
RIT Colloquium (May 23, 2007)
Paul Avery
35
Massive 2007 Data Reprocessing
by D0 Experiment @ Fermilab
LCG
~ 400M total
~ 250M OSG
OSG
SAM
RIT Colloquium (May 23, 2007)
Paul Avery
36
CDF Discovery of Bs Oscillations
Bs Bs
xs
f
2.8THz
2
Bs et / 2 sin xst / Bs1 cos xst / Bs 2
RIT Colloquium (May 23, 2007)
Paul Avery
37
Communications:
International Science Grid This Week
SGTW iSGTW
From April 2005
Diverse audience
>1000 subscribers
www.isgtw.org
RIT Colloquium (May 23, 2007)
Paul Avery
38
OSG News: Monthly Newsletter
18 issues by Apr. 2007
www.opensciencegrid.org/
osgnews
RIT Colloquium (May 23, 2007)
Paul Avery
39
Grid Summer Schools
Summer
2004, 2005, 2006
1
week @ South Padre Island, Texas
Lectures plus hands-on exercises to ~40 students
Students of differing backgrounds (physics + CS), minorities
Reaching
a wider audience
Lectures,
exercises, video, on web
More tutorials, 3-4/year
Students, postdocs, scientists
Agency specific tutorials
RIT Colloquium (May 23, 2007)
Paul Avery
40
Project Challenges
Technical
constraints
Commercial
tools fall far short, require (too much) invention
Integration of advanced CI, e.g. networks
Financial
constraints (slide)
Fragmented
& short term funding injections (recent $30M/5 years)
Fragmentation of individual efforts
Distributed
coordination and management
Tighter
organization within member projects compared to OSG
Coordination of schedules & milestones
Many phone/video meetings, travel
Knowledge dispersed, few people have broad overview
RIT Colloquium (May 23, 2007)
Paul Avery
41
Funding & Milestones: 1999 – 2007
Grid Communications
First US-LHC
Grid Testbeds
UltraLight, $2M
GriPhyN, $12M
Grid3 start
iVDGL, $14M
2000
2001
LIGO Grid
2002
VDT 1.0
2003
Grid, networking projects
Large experiments
Education, outreach, training
RIT Colloquium (May 23, 2007)
VDT 1.3
2004
OSG start
2005
2006
LHC start
2007
CHEPREO, $4M
PPDG, $9.5M
DISUN, $10M
Grid Summer Schools OSG, $30M NSF,
2004, 2005, 2006
DOE
Digital Divide Workshops
04, 05, 06
Paul Avery
42
Challenges from Diversity and Growth
Management
of an increasingly diverse enterprise
Sci/Eng
projects, organizations, disciplines as distinct cultures
Accommodating new member communities (expectations?)
Interoperation
with other grids
TeraGrid
International
partners (EGEE, NorduGrid, etc.)
Multiple campus and regional grids
Education,
outreach and training
Training
for researchers, students
… but also project PIs, program officers
Operating
a rapidly growing cyberinfrastructure
100K CPUs, 4 10 PB disk
Management of and access to rapidly increasing data stores (slide)
Monitoring, accounting, achieving high utilization
Scalability of support model (slide)
25K
RIT Colloquium (May 23, 2007)
Paul Avery
43
Rapid Cyberinfrastructure Growth: LHC
Meeting
LHC service challenges & milestones
Participating in worldwide simulation productions
350
200
2008: ~140,000
PCs
150
Tier-1
MSI2000
250
Tier-2
300
100
0
2007
RIT Colloquium (May 23, 2007)
CERN
50
LHCb-Tier-2
CMS-Tier-2
ATLAS-Tier-2
ALICE-Tier-2
LHCb-Tier-1
CMS-Tier-1
ATLAS-Tier-1
ALICE-Tier-1
LHCb-CERN
CMS-CERN
ATLAS-CERN
ALICE-CERN
2008
2009
Year
Paul Avery
2010
44
OSG Operations
Distributed model
Scalability!
VOs, sites, providers
Rigorous problem
tracking & routing
Security
Provisioning
Monitoring
Reporting
Partners with EGEE operations
RIT Colloquium (May 23, 2007)
Paul Avery
45
Five Year Project Timeline & Milestones
Contribute to Worldwide LHC Computing Grid
LHC Event Data Distribution and Analysis
Support 1000 Users; 20PB Data Archive
LHC Simulations
Contribute to LIGO Workflow and Data Analysis
LIGO data run SC5
Advanced LIGO
LIGO Data Grid dependent on OSG
STAR, CDF, D0, Astrophysics
CDF Simulation
CDF Simulation and Analysis
D0 Simulations
D0 Reprocessing
STAR Data Distribution and Jobs
Additional Science Communities
006
10KJobs per Day
+1 Community +1 Community +1 Community +1 Community
2007
2008
+1 Community +1 Community +1 Community +1 C
2009
2010
201
Facility Security : Risk Assessment, Audits, Incident Response, Management, Operations, Technical Controls
Plan V1
1st Audit
Risk
Audit
Risk
Audit
Risk
Assessment
Assessment
Assessment
Facility Operations and Metrics: Increase robustness and scale; Operational Metrics defined and validated each year.
Audit
Risk
Assessment
Interoperate and Federate with Campus and Regional Grids
VDT and OSG Software Releases: Major Release every 6 months; Minor Updates as needed
VDT 1.4.0 VDT 1.4.1 VDT 1.4.2
…
…
…
OSG 0.6.0 OSG 0.8.0 OSG 1.0
OSG 2.0
OSG 3.0
…
…
VDT
Incremental
Updates
dCache with Accounting Auditing
Federated monitoring and
role based
information services
VDS with SRM
authorization
Common S/w Distribution
Transparent data and job
with TeraGrid
movement with TeraGrid
EGEE using VDT 1.4.X Transparent data management with EGEE
Extended Capabilities & Increase Scalability and Performance for Jobs and Data to meet Stakeholder needs
SRM/dCache
“Just in Time” Workload VO Services
Integrated Network Management
Extensions
Management
Infrastructure
Data Analysis (batch and
Improved Workflow and Resource Selection
interactive) Workflow
Work with SciDAC-2 CEDS and Security with Open Science
2006
2007
2008
RIT Colloquium (May 23, 2007)
Project start
End of Phase I
2009
Paul Avery
End of Phase II
2010
2011
46
Extra Slides
RIT Colloquium (May 23, 2007)
Paul Avery
47
VDT Release Process (Subway Map)
Gather requirements
Time
Day 0
Build software
Test
Validation test bed
VDT Release
ITB Release Candidate
Integration test bed
Day N
OSG Release
From Alain Roy
RIT Colloquium (May 23, 2007)
Paul Avery
48
VDT Challenges
How
should we smoothly update a production service?
In-place
vs. on-the-side
Preserve old configuration while making big changes
Still takes hours to fully install and set up from scratch
How
do we support more platforms?
A
struggle to keep up with the onslaught of Linux distributions
AIX? Mac OS X? Solaris?
How
can we accommodate native packaging formats?
RPM
Deb
Fedora Core 6
BCCD
RIT Colloquium (May 23, 2007)
Paul Avery
Fedora Core 4
RHEL 3
49 4
Fedora Core 3 RHEL