OSG Overview for the Science Advisory Group Ruth Pordes Fermilab June 12th 2007 Goals of The OSG • Maintain the Distributed Facility  Through a core.

Download Report

Transcript OSG Overview for the Science Advisory Group Ruth Pordes Fermilab June 12th 2007 Goals of The OSG • Maintain the Distributed Facility  Through a core.

OSG Overview
for the Science Advisory Group
Ruth Pordes
Fermilab
June 12th 2007
Goals of The OSG
• Maintain the Distributed Facility
 Through a core of usable, extensible, dependable,
secure distributed infrastructure delivering to the
science needs of the stakholders.
• Provide mechanisms and help for user groups
to adapt their codes and use the OSG.
• Provide for opportunistic use of shared
resources as well as resource-use through prior
agreement.
• Provide an integrated, secure, reference
software stack for OSG and other Grids.
• Grow to be a truly national resource that anyone
can join and available for any researcher.
Scientific Advisory Group 11/6/2015
2
Benefits to Science and Research
• Enable scientists to use a greater % of the
available compute cycles.
• Help scientists to use distributed systems and
software with less effort.
• Enable more sharing and reuse of software
and reduce duplication of effort through
providing effort in integration and extensions.
• Establish “open-source” community working
together to communicate knowledge and
experience and also overheads for new
participants.
Scientific Advisory Group 11/6/2015
3
Cost-Value Model
• Increased usage of CPUs and infrastructure alone (ie
cost of processing cycles) - is not the persuading
cost-benefit value.
• The benefits come from reducing risk in and sharing
support for large, complex systems which must be run
for many years with a short life-time workforce.
 Savings in effort for integration, system and software support,
 Opportunity and flexibility to distribute load and address peak
needs.
 Maintainance of an experienced workforce in a common
system
 Lowering the cost of entry to new contributors.
 Enabling of new computational opportunities to communities
that would not otherwise have access to such resources.
Scientific Advisory Group 11/6/2015
4
OSG in a nutshell
Scientific Advisory Group 11/6/2015
5
History
LIGO operation
LIGO preparation
LHC construction, preparation
LHC Ops
iVDGL(NSF)
GriPhyN(NSF)
Trillium Grid3
OSG (DOE+NSF)
PPDG (DOE)
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
European Grid + Worldwide LHC Computing Grid
Grid projects established working collaborations between
Condor, Globus & physics experiments.
OSG leadership led a “grass-roots” collaboration of these
projects,.
US LHC program committed to a joint project with broader
contributions and goals. LIGO committed to a data grid model.
DOE and NSF accepted a joint SciDAC and unsolicited NSF
proposal.
Scientific Advisory Group 11/6/2015
Campus, regional grids
6
The Consortium and the Project
• The Consortium comprises all institutions and
projects that contribute to OSG.
• The Project is funded to provide staff for
specific aspects of managing and sustaining
the OSG.
• The deliverables and milestones of the
project are for the scientific needs of the
consortium members.
• All OSG activities involve both Project staff
and contributors from the Consortium.
Scientific Advisory Group 11/6/2015
7
Structure of the Consortium
Scientific Advisory Group 11/6/2015
8
Scope of the OSG Project
• Included:
 The distributed facility operation and maintenance.
 Training and education of new participants and
contributors.
 Management and administration of the project and the
consortium.
 Extensions to and integration of new services, software,
capabilities and user communities.
• Not Included:
 Resources - farms and storage - are
contributed.Currently access to ~34K cores, 2 PB disk
storage.
 Software - facility and application - is developed by
external projects with own priorities and schedules.
Scientific Advisory Group 11/6/2015
9
Structure of the Project
FTEs Planned
Facility Management and
operations.
7.0
Security & troubleshooting
4.5
Software release & support
6.5
Engagement
2.0
Training & Education
2.0
Extensions
8.0
Executive Director and
administration
3.0
Total
Scientific Advisory Group 11/6/2015
33
10
What do I do as Executive Director?
• Work with the area coordinators & institutional PIs.
 To define and execute the program of work.
 To make expectations and what happens come together, be communicated and
understood.
 To match needs, priorities, and effort. Many OSG staff are fractions of an FTE.
 To collaborate with external software development projects on which we depend
• Work with the Council & Consortium





As the interface to the project in many areas.
On large-scale requests for use of the resources.
On agreements with partners for bi-lateral commitments.
On extending our membership and participation.
Organize reviews, Joint Oversight Team presentations and Consortium meetings.
• Communicate a lot e.g.
 Represent OSG on the WLCG Management Board.
 Interface to the funding agencies.
 Present OSG in various meetings.
Scientific Advisory Group 11/6/2015
11
Some of the Challenges?
• Making the consortium and project work
with people from different organization
cultures and with different goals in
terms of “success”.
• Bringing a focus on operations and
stability rather than development and
“innovation”.
• Balancing directed needs of
stakeholders with broader scope of
commitments.
Scientific Advisory Group 11/6/2015
12
Institutions Involved
Project Staff
FTEs
Sites on OSG : Many with >1 resource. 46 separate institutions.
Boston
0.5
* - no physics
U. Of Michigan
Florida State U.
Nebraska
U. Of Arkansas *
BNL
3.0
Kansas State
LBNL
U. Of Chicago
CalTech
2.0
U of Iowa
Notre Dame
U. California at
Riverside
Columbia
0.5
Cornell
0.5
Academia Sinica
Hampton U
Penn State U
UCSD
FermiLab
7.0
Brookhaven
National Lab
UERJ Brazil
Oaklahoma U.
U. Of Florida
ISI (year 1)
0.5
Boston U.
Iowa State
SLAC
U. Illinois Chicago
Indiana U
3.0
Cinvestav,
Mexico City
Indiana
University
Purdue U.
U. New Mexico
LBNL
1.5
RENCI
1.5
Caltech
Lehigh
University *
Rice U.
U. Texas at
Arlington
SLAC
0.5
UCSD
2.0
Louisiana
University
Southern
Methodist U.
U. Virginia
U of Chicago
3.0
U of Florida
0.5
Dartmouth U *
Louisiana Tech
*
U. Of Sao Paolo
U. Wisconsin
Madison
U of Iowa
1.0
Florida
International U.
McGill U
Wayne State U.
U. Wisconsin
Milwaukee
Wisconsin
6.0
Clemson U. *
Fermilab
MIT
TTU
Scientific Advisory Group 11/6/2015
Total
33
Vanderbilt U.
13
Users and Communities/VOs
Campus Grids: 5.
Research VOs: 15. 5 are non-physics
Georgetown University Grid (GUGrid)
Collider Detector at Fermilab (CDF)
Grid Laboratory of Wisconsin (GLOW)
Compact Muon Solenoid (CMS)
Grid Research and Education Group at Iowa
(GROW)
CompBioGrid (CompBioGrid)
D0 Experiment at Fermilab (DZero)
Dark Energy Survey (DES)
University of New York at Buffalo (GRASE)
Fermi National Accelerator Center (Fermilab)
Functional Magnetic Resonance Imaging (fMRI)
Regional Grids: 4
Geant4 Software Toolkit (geant4)
NYSGRID
Genome Analysis and Database Update (GADU)
Distributed Organization for Scientific and
Academic Research (DOSAR)
International Linear Collider (ILC)
Laser Interferometer Gravitational-Wave Observatory
(LIGO)
nanoHUB Network for Computational
Nanotechnology (NCN) (nanoHUB)
Sloan Digital Sky Survey (SDSS)
Solenoidal Tracker at RHIC (STAR)
Structural Biology Grid (SBGrid)
United States ATLAS Collaboration (USATLAS)
Scientific Advisory Group 11/6/2015
Great Plains Network (GPN)
Northwest Indiana Computational Grid
(NWICG)
OSG Operated VOs: 4
Engagement (Engage)
Open Science Grid (OSG)
OSG Education Activity (OSGEDU)
OSG Monitoring & Operations
14
CPUHours/Day on OSG During 2007
160,000
140,000
120,000
100,000
80,000
60,000
40,000
20,000
1/
1/
0
1/ 7
8
1/ /07
15
1/ /07
22
1/ /07
29
/0
2/ 7
5/
2/ 07
12
2/ /07
19
2/ /07
26
/0
3/ 7
5
3/ /07
12
3/ /07
19
3/ /07
26
/0
4/ 7
2/
0
4/ 7
9
4/ /07
16
4/ /07
23
4/ /07
30
/0
5/ 7
7/
5/ 07
14
5/ /07
21
5/ /07
28
/0
7
0
AGLT2
FNAL_CDFOSG_1
GRASE-GENESEO-OSG
Nebraska
Purdue-Lear
UC_ATLAS_MWT2
USCMS-FNAL-WC1-CE2
ASGC_OSG
FNAL_CDFOSG_2
GROW-PROD
NERSC-PDSF
Purdue-RCAC
UCRHEP
UTA_SWT2
BNL_OSG
FNAL_FERMIGRID
HEPGRID_UERJ
OSG_LIGO_PSU
SPRACE
UCSDT2
UTA-DPCC
BNL_PANDA
FNAL_GPFARM
IPAS_OSG
OU_OCHEP_SWT2
STAR-BNL
UFlorida-IHEPA
UWMilwaukee
CIT_CMS_T2
GLOW
Lehigh Coral
OU_OSCER_ATLAS
STAR-WSU
UFlorida-PG
Vanderbilt
FIU-PG
GRASE-CCR-U2
MIT_CMS
OU_OSCER_CONDOR
TTU-ANTAEUS
USCMS-FNAL-WC1-CE
Currently undercounting probably ~25% as not all sites are reporting.
1 CPUYear ~9,000 CPUHours
Scientific Advisory Group 11/6/2015
15
National Activities
• We interoperate and collaborate with
TeraGrid through:
 Several communities run applications across both.
 Several sites are on both.
 Common Condor and Globus versions and testing
infrastructure.
 Shared training exercises.
• We promote the development of local
infrastructures and expertise.
 CampusInfrastructure Days with Internet2,
TeraGrid, Educause to help Campus (CIOs,
researchers, teaching departments) identify crosscampus needs and organize themselves to
participate.
Scientific Advisory Group 11/6/2015
16
International Activities
• We deliver the US contribution to the World
Wide Large Hadron Collider (LHC) Grid
Collaboration (WLCG) in support of the LHC
experiments.
 Importance of interoperability and compatability
with other WLCG infrastructures.
• Several communities run jobs and transfer
data across the Enabling Grids for EScience
(EGEE) and OSG.
• Several sites and partners are international.
Scientific Advisory Group 11/6/2015
17
CPUHours/Day by VO
160,000
140,000
120,000
100,000
80,000
60,000
40,000
20,000
cdf
gpn
sdss
7
/0
28
7
5/
/0
7
21
/0
dzero
engage
miniboone mipp
5/
14
07
5/
7/
7
5/
/0
30
7
4/
/0
7
23
/0
4/
16
07
dosar
LIGO
4/
4/
9/
07
2/
7
des
kt ev
4/
26
/0
7
7
/0
3/
19
3/
/0
12
07
cms
ilc
zeus
3/
5/
7
3/
/0
26
7
cdms
grow
st ar
2/
7
/0
19
2/
/0
12
2/
2/
5/
07
7
/0
29
7
AT LAS
glow
osg
1/
7
/0
22
/0
1/
15
07
1/
8/
1/
1/
1/
07
0
gadu
nanohub
Engage is running Rosetta at home, from the Kulhman Lab.
OSG is running protein molecular dynamics (-CHARMM), Johns Hopkins
Scientific Advisory Group 11/6/2015
18
We measure how we are doing
• Summaries of support requests and
resolutions.
• Accounting information
 of CPU, storage and data transfer by site and VO
 includes shared and opportunistic resource use
 Includes information from the user accounting
systems.
 Includes some error reporting information.
• Availability testing, monitoring,& display.
Feedback from agencies is that we need more
of this.
Scientific Advisory Group 11/6/2015
19
How do we know if we are doing
well?
• Feedback from users and sites important and
ongoing - mail lists, weekly operations
meetings, Council meetings.
• Gathering information for research briefs,
monthly news articles, gives us a feel for if
the use of OSG is benefiting scientific and
research output.
• Project deliverables and milestones give a
measure of how well the project is executing
its plans.
Scientific Advisory Group 11/6/2015
20
Project Planning
• Overall 5 year goals and milestones come from the
proposal.
• A yearly plan of work is made with the Area
Coordinators which results in:
 deliverables, activities & schedule (captured in a WBS
structure).
 high level milestones - some agency reportable,
 effort assignments.
• We have signed Statements of Work with each
institutional PI with project funds.
 There is a working change control process.
• We revise our plans via weekly Executive Team
and every-six week Executive Board meetings.
Scientific Advisory Group 11/6/2015
21
Project Tracking
• Milestones are tracked by the Project
Associate and discussed in weekly Executive
Team meetings.
• Area coordinators and Institutional PIs submit
quarterly reports.
• Accounted expenditures are tracked
quarterly.
• Staff submit monthly reports.
• Weekly area and activity meetings are used
for day to day tracking and discussion of
progress.
Scientific Advisory Group 11/6/2015
22
Resource Needs and Resource Availability
• Many resources are owned or statically
allocated to one user community.
 The institutions which own resources typically have
ongoing relationships with (a few) particular user
communities (VOs)
• The remainder of an organization’s available
resources can be “used by everyone or anyone
else”.
 organizations can decide against supporting
particular VOs.
 OSG staff are responsible for monitoring and, if
needed, managing this usage.
• Our challenge is to maximize good successful - output from the whole system.
Scientific Advisory Group 11/6/2015
23
An Example: D0 reprocessing
• D0’s own resources are committed to the
processing of newly acquired data and analysis
of the processed datasets.
• In Nov ‘06 D0 asked to use 1500-2000 CPUs for
2-4 months for re-processing of an existing
dataset (~500 million events) for science results
for the summer conferences in July ‘07.
• The Executive Board estimated there were
currently sufficient opportunistically available
resources on OSG to meet the request; We also
looked into the local storage and I/O needs.
• The Council members agreed to contribute
resources to meet this request.
Scientific Advisory Group 11/6/2015
24
How did D0 Reprocessing Go?
• D0 had 2-3 months of smooth production
running using >1,000 CPUs and met their
goal by the end of May.
• To achieve this
 D0 testing of the integrated software system took
until February.
 OSG staff and D0 then worked closely together as
a team to reach the needed throughput goals facing and solving problems
 sites - hardware, connectivity, software configurations
 application software - performance, error recovery
 scheduling of jobs to a changing mix of available
resources.
Scientific Advisory Group 11/6/2015
25
The Results
• Reprocessing was completed, albeit late. 445 million events
were reprocessed
 12 sites contributed significant resources.
 Over 1000 jobs a day was sustained.
•
•
•
•
 286 million events were done on OSG sites
The initial rampup to scale was slow and labor intensive for both
D0 and OSG.
Changes in availability of resources had negative impact.
Sustaining the throughput was manpower intensive on the D0
side.
Problems encountered:
 Each site had unique problems when initially used. Sites were less
stable than expected.
 Root cause diagnosis and analysis of problems was very difficult.
 Scaling up showed problems in throughput and overheads.
Scientific Advisory Group 11/6/2015
26
D0 Throughput
D0 Event Throughput
D0 OSG CPUHours / Week
160,000
140,000
120,000
100,000
80,000
60,000
40,000
20,000
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23
Week in 2007
CIT_CMS_T2
FNAL_GPFARM
MIT_CMS
NERSC-PDSF
OU_OSCER_CONDOR
UCSDT2
USCMS-FNAL-WC1-CE
Scientific Advisory Group 11/6/2015
FNAL_DZEROOSG_2
GLOW
MWT2_IU
OSG_LIGO_PSU
Purdue-RCAC
UFlorida-IHEPA
FNAL_FERMIGRID
GRASE-CCR-U2
Nebraska
OU_OSCER_ATLAS
SPRACE
UFlorida-PG
27
What did this teach us ?
•
Consortium members contributed significant
opportunistic resources as promised.
• VOs can use a significant number of sites
they “don’t own” to achieve a large effective
throughput.
• Combined teams make large production runs
effective.
 How does this scale?
• Overall availability was sufficient for the
request to be met.
 how we going to support multiple requests that
oversubcribe the resources? We anticipate this
may happen soon.
Scientific Advisory Group 11/6/2015
28
Training
• Grid Schools train students, teachers and new
entrants to use grids:
 2-3 day training with hands on workshops and core
curriculum (based on iVDGL annual weeklong schools).
 3 held already; several more this year (2 scheduled).
Some as participants in internationals schools.
 20-60 in each class. Each class regionally based with
broad cachement area.
 Gathering an online repository of training material.
• End-to-end application training in collaboration
with user communities.
Scientific Advisory Group 11/6/2015
29
Education
• We participate as part of
cyberinfrastructure educational projects:
 I2U2 extension to Quarknet project.
 site at South African University.
• Student Projects:
 Now the new Education Coordinator is
starting, we will follow up with students and
their organizations to help them use OSG
for projects and research.
Scientific Advisory Group 11/6/2015
30
Some of the Challenges I worry about
• How do we ensure, measure and show
scientific benefit both to our existing
stakeholders and new communities?
• What activities do we need towards a
sustainable economic model for
operation and support?
Scientific Advisory Group 11/6/2015
31