Review of NCAR Al Kellie SCD Director November 01, 2001 Outline of Presentation • Introduction to • UCAR • NCAR • SCD • Overview of divisional activities • • • • • Research data.

Download Report

Transcript Review of NCAR Al Kellie SCD Director November 01, 2001 Outline of Presentation • Introduction to • UCAR • NCAR • SCD • Overview of divisional activities • • • • • Research data.

Review of NCAR
Al Kellie
SCD Director
November 01, 2001
Outline of Presentation
• Introduction to
• UCAR
• NCAR
• SCD
• Overview of divisional activities
•
•
•
•
•
Research data sets (Worley)
Mass Storage System (Harano)
Extracting model performance (Hammond)
Visualization & Earth System GRiD (Middleton)
Computing RFP (ARCS)
Outline of Presentation
• INTRODUCTION
• Overview of three divisional aspects
• Computing RFP (ARCS)
University Corporation for Atmospheric Research
Member Institutions
Board of Trustees
Finance &
Administration
President
Corporate
Affairs
Katy Schmoll, VP
Richard Anthes
Jack Fellows, VP
NCAR
UCAR
Programs
Tim Killeen, Director
Jack Fellows, Director
Information
Infrastructure
Technology &
Applications
(IITA)
Richard
Chinman
Atmospheric
Chemistry Division
(ACD)
Atmospheric
Technology Division
(ATD)
David
Carlson
Daniel
McKenna
Environmental
& Societal
Impacts Group
(ESIG)
Robert
Harriss
Advanced Study
Program
(ASP)
High
Altitude
Observatory
(HAO)
Michael
Knölker
Climate & Global
Dynamics Diviion
(CGD)
Al
Cooper
Mesoscale &
Microscale
Meteorological Division
(MMM)
Robert
Gall
Maurice
Blackmon
Research
Applications
Programs
(RAP)
Scientific
Computing
Division
(SCD)
Brant
Foote
Al Kellie
Cooperative Program for
Optional Meteorology
Education and Training
(COMET)
Constellation Observing
System for Meteorology
Ionosphere Climate
(COSMIC)
Timothy
Spangler
GPS Science and
Technology Program
(GST)
Randolph
Ware
Bill
Kuo
Digital Library
for Earth System
Science (DLESE)
Mary
Marlino
Unidata
Visiting Scientists
Programs
(VSP)
Joint Office
for Science
Support
(JOSS)
David
Fulker
Meg
Austin
Karyn
Sawyer
Denotes President’s Office 12/07/98
NCAR Organization
Atmospheric
Chemistry
Dan McKenna
Climate &
High Altitude
Global
Observatory
Dynamics
Michael Knolker
Maurice Blackmon
Mesoscale &
Microscale
Meteorology
Bob Gall
NCAR
Tim Killeen
ESIG
Bob Harriss
ASP
Al Cooper
Scientific
Computing
Al Kellie
Atmospheric
Technology
Dave Carlson
Associate Director
Steve Dickson
UCAR
Rick Anthes
UCAR Board
of
Trustees
ISS
K. Kelly
B&P
R.Brasher
Research
Applications
Brant Foote
NCAR at a Glance
 41 years; 850 Staff – 135 Scientists
 $128M budget for FY2001
 9 divisions and programs
 Research tools, facilities, and visitor
programs for the NSF and university
communities
EPA
0.4%
FAA
7%
DOE Other
2%
8%
DOD
5%
NOAA
3%
NASA
8%
NSF Regular
62%
NSF Special
5%
FY 2000 Funding Distribution
NCAR FY 2000 Expenditures/Commitments
Total FY2001 funding: $128M
NCAR Peer-Reviewed
Publications
500
400
300
285
145
214
209
206
182
172
170
1997
1998
1999
2000
200
100
0
NCAR Authors
Joint with outside authors
NCAR Visitors
800
700
95
600
500
400
45
17
134
38
300
90
200
129
100
42
80
73
133
169
31-90
8-30
1-7 Days
318
310
1999
2000
141
0
1998
180+
91-180
Where did SCD come from?
1959 “Blue Book” Link
“There are four compelling reasons for establishing
a National Institute for Atmospheric Research”
2. The requirement for facilities and technological assistance
beyond those that can properly be made available at
individual universities
SCD Mission
Enable the best atmospheric &
related research, no matter where
the investigator is located through
the provision of high performance
computing technologies and
related services
SCIENTIFIC COMPUTING DIVISION
DIRECTOR’S OFFICE
Al Kellie, Director (12)
Computational Science
Steve Hammond (8)
Algorithmic Software Development
Model performance Research
Science Collaboration
Frameworks
Standards & Benchmarking
Networking Engineering &
Telecommunications
Marla Meehl (25)
LAN
MAN
WAN
Dial-up Access
Network Infrastructure
Data Support
Roy Jenne (9)
Data Archives
Data Catalogs
User Assistance
Operations and Infrastructure
Support
Aaron Andersen (18)
Operations Room
Facility Management & Reporting
Database Applications
Site Licenses
High Performance Systems
Gene Harano (13)
Supercomputer Systems
Mass Storage Systems
User Support Section
Ginger Caldwell (21)
Training/Outreach/Consulting
Digital Information
Distributed Servers & Workstations
Allocations & Account Management
Visualization & Enabling
Technologies
Don Middleton (12)
Base
$24,874
Ucar
$4,027
Outside
$2,020
Overhead
$1,063
Data Access
Data Analysis
Visualization
SCD BASE BUDGET DISTRIBUTION
$4,000,000
G&A
OVERHEAD
$4,800,000
SALARIES
$2,100,000
$2,200,000
OTHER OPERATING
COSTS
BENEFITS
$11,000,000
EQUIPMENT,
MAINT,SFWR
Computing Services for Research
• Operates two distinct computational facilities.
– Climate simulations
– University community
• Governance of these SCD resources in the hands
of the users - two external allocation
committees.
• Computing leverages a common infrastructure
for access, networking, data storage & analysis,
research data sets, and support services
including software development, and consulting.
Climate Simulation Laboratory (CSL)
•The CSL is a national, multi-agency, special-use,
computing facility for climate system modeling for
the U.S. Global Change Research Program
(USGCRP).
– Priority projects that require very large amounts of
computer time.
• CSL resources are available to U.S. individual
researchers with a preference for research teams
regardless of sponsorship.
• An inter-agency panel selects the projects that use
the CSL.
Community Facility
• The Community Facility is used primarily by
university-based NSF grantees and NCAR Scientists.
– Community resources are allocated evenly
between NCAR and the university community.
• NCAR resources are allocated by the NCAR
Director to the various NCAR divisions.
• University resources are allocated by the SCD
Advisory Panel. Open to areas of atmospheric and
related sciences.
Distribution of Compute Resources
All Use of SCD Computing by Area of Scientific Interest
through September 2000 for Fiscal Year 2000
Clim ate
25.36%
Astrophysics
21.10%
Other
4.72%
Miscellaneous
11.19%
Weather
Prediction
17.81%
Oceanography
Upper
9.24%
Atmosphere
10.58%
Other includes:
Basic Fluid Dyanmics (1.50%)
Cloud Physics (3.22%)
100
90
80
70
60
50
40
30
20
10
0
FY
90
FY
91
FY
92
FY
93
FY
94
FY
95
FY
96
FY
97
FY
98
FY
99
FY
00
Percentage
SCD Computing by Area of Scientific Interest
Climate
100
90
80
70
60
50
40
30
20
10
0
FY
90
FY
91
FY
92
FY
93
FY
94
FY
95
FY
96
FY
97
FY
98
FY
99
FY
00
Percentage
SCD Computing by Area of Scientific Interest
Weather Prediction
100
90
80
70
60
50
40
30
20
10
0
FY
90
FY
91
FY
92
FY
93
FY
94
FY
95
FY
96
FY
97
FY
98
FY
99
FY
00
Percentage
SCD Computing by Area of Scientific Interest
Oceanography
History of Supercomputing at NCAR
CDC 3600
1960
1970
CDC 7600
CDC 6600
1980
IBM SP/1308
IBM SP/604
IBM SP/64
Compaq ES40/36 Cluster
IBM SP/296
IBM SP/32
Beowulf/16
SGI Origin2000/128
Cray J90se/24
HP SPP-2000/64
Cray T3D/128
Cray J90se/24
Cray Y-MP/8I
Cray J90/16
Cray J90/20
Cray T3D/64
Cray C90/16
CCC Cray 3/4
IBM SP1/8
TMC CM5/32
IBM RS/6000 Cluster
Cray Y-MP/8
Cray X-MP/4
TMC CM2/8192
Cray 1-A S/N 14
Cray Y-MP/2
Production Machines
Cray 1-A S/N 3
Non-Production Machines
Currently in Production
1990
1995
1999 2000 2001
STK 9940
#5
#4
2001
NCAR Wide Area Connectivity
• OC3 (155Mbps) to the Front Range GigaPop - OC12
(622Mbps) on 1/1/2002
– OC3 to AT&T Commodity Internet
– OC3 to C&W Commodity Internet
– OC3 to Abilene (OC12 on 1/1/2002)
• OC3 to the vBNS+
• OC12 (622Mbps) to University of Colorado at
Boulder
– intra-site research and back-up link to FRGP
• OC12 to NOAA/NIST in Boulder
– Intra-site research and UUNET Commodity Internet
• Dark fiber metropolitan area network at GigE
(1000Mbps) to other NCAR campus sites
TeraGrid Wide Area Network
StarLight
International Optical Peering Point
(see www.startap.net)
Abilene
Chicago
Indianapolis
Urbana
* DENVER
Los Angeles
San Diego
OC-48 (2.5 Gb/s, Abilene)
Multiple 10 GbE (Qwest)
Multiple 10 GbE (I-WIRE Dark Fiber)
• Solid lines in place and/or available by October 2001
• Dashed I-WIRE lines planned for summer 2002
UIC
I-WIRE
ANL
Starlight / NW Univ
Multiple Carrier Hubs
Ill Inst of Tech
Univ of Chicago
NCSA/UIUC
Indianapolis
(Abilene NOC)
ARCS Synopsis
Credit: Tom Engel
ARCS RFP Overview
• BEST VALUE PROCUREMENT
–
–
–
–
–
–
–
–
–
–
Technical evaluation
Delivery schedule
Production disruption
Allocation ready state
Infrastructure
Maintenance
Cost impact – i.e. existing equipment
Past performance of bidders
Business proposal review
Other considerations - invitation to partner
ARCS Procurement
• Production-level
– Availability, robust batch capacity, operational
sustainability and support
– Integrated software engineering and development
environment
• High performance execution of existing
applications
• Additionally – environment conducive to
development of next-generation models
Workload profile context
• Jobs using > 32 nodes
– 0.4 % of workload
– Average 44 nodes or 176 pes
• Jobs using < 32 nodes
– 99.6 % of workload
– Average 6 nodes or 24 pes
ARCS – The Goal
• A production-level, high-performance computing system
providing for both capability and capacity computing
• A stable and upwardly compatible system architecture,
user environment, and software engineering &
development environments
Long Term:
Achieve 1 TFLOPs
sustained by 2005
1.0
Sustained TFLOPs
Initial equipment:
At least double current
capacity at NCAR
1.2
0.8
0.6
0.4
0.2
0.0
2001
2002
2003
2004
2005
ARCS – The Process
• SCD began technical requirements draft Feb 2000
• RFP process (including scientific reps from NCAR
divisions, UCAR Contracts, & external review panel)
formally began Mar 2000; RFP released Nov 2000
• Offeror proposal reviews, BAFOs, & Supplemental
proposals Jan-May 2001
• Technical Evaluations, Performance projections, Risk
Assessment, etc. Feb-Jun 2001
• SCD Recommendation for Negotiations 21 Jun; NCAR/
UCAR acceptance of recommendation 25 Jun
• Negotiations 24-26 Jul; tech. Ts&Cs completed 14 Aug
• Contract submitted to the NSF 01 Oct
• NSF Approval 5 Oct … Joint Press Release week SC01
ARCS RFP Technical Attributes
• Hardware (processors, nodes, memory, disk,
interconnect, network, HIPPI)
• Software (OS, user environment, filesystems,
batch subsystem)
• System admin., resource mgmt., user limits,
accounting, network/HIPPI, security
• Documentation & training
• System maintenance & support services
• Facilities (power, cooling, space)
Major Requirements
• Critical Resource ratios:
– Disk
– Memory
6 Bytes/peak-FLOP: 64+ MB/sec single-stream &
2+ GB/sec bandwidth - sustainable
0.4 Bytes/peak-FLOP
• “Full-featured” product set (cluster-aware compilers, debuggers,
performance tools, administrative tools, monitoring)
• Hardware & Software stability
• Hardware & Software vendor support & responsiveness (on-site,
call center, development organization, escalation procedures)
• Resource allocation (processor(s), node(s), memory, disk; user
limits & disk quotas)
• Batch Subsystem and NCAR job scheduler (BPS)
ARCS – Benchmarks (1)
• Kernels (Hammond
Harkness, Loft)
Single Processor (COPY, IA,
XPOSE, SHAL, RADABS,
ELEFUNT, STREAMC)
Multi-processor shared
memory (PSTREAM)
Message-Passing Performance
(XPAIR, BISECT, XGLOB,
COMMS[1,2,3],
STRIDED[1,2], SYNCH,
ALLGATHER)
• Parallel Shared Memory
Applications
– CCM3.10.16 (T42 30-days
& T170 1-day) – CGD,
Rosinski
– WRF Prototype (b_wave 5days) - MMM, Michalakes
more >
ARCS – Benchmarks (2)
• Parallel (MPI & Hybrid)
models
• System Tests
–
–
–
–
– CCM3.10.16 (T42 30-day &
T170 1-day – CGD, Rosinski
– MM5 3.3 (t3a 6-hr & “large”
1-hr) – MMM, Michalakes
– POP 1.0 (medium & large) –
CGD, Craig
– MHD3D (medium & large) –
HAO, Fox
– MOZART2 (medium & large)
– ACD, Walters
– PCM 1.2 (T42) – CGD, Craig
– WRF Prototype (b_wave 5day) – MMM, Michalakes
< return
HIPPI – SCD, Merrill
I/O-tester – SCD, Anderson
Network – SCD, Mitchell
Batch Workload
includes: 2 I/O-tester, 4 Hybrid
MM5 3.3 large, 2 Hybrid
MM5 3.3 t3a, 2 POP 1.0
medium & large, ccm3.10.16
T170, MOZART2 medium,
PCM 1.2 T42, 2 MHD3D
medium & large, WRF
Prototype – SCD, Engel
Risks
• Vendor ability to meet commitments
– Hardware (processor architecture, clock speed
boosts, memory architecture)
– Software (OS, filesystems, processor-aware
compilers/libraries, tools [3rd party])
• Service, Support, Responsiveness
• Vendor stability (product set, financial)
• Vendor promises vs. reality
Past Performance
• Hardware & Software
– SCD/NCAR experience
– Other customers’ experience
• “Missed Promises”
– Vendor X ~ 2 yr slip, product line changes
– Vendor Y ~ on target
– Vendor Z ~ 1.5 yr slip, product line changes
Other Considerations
• “Blue Light” project invitation to develop of
models for an exploratory supercomputer
– Invitation to a partnership development.
– Offer for an industrial partnership
• 256 Tflops peak, 8TB mem, 200TB disk on 64k nodes.
True MPP with Torus interconnect.
• Node-64 Gflops, 128 MB mem, 32 kB L1 cache, 4MB
L2 cache
– Columbia, LLNL, SDSC, Oak Ridge
ARCS Award
• IBM was chosen to supply the NCAR
Advanced Research Computing System
(ARCS) …
… will exceed the articulated
purpose and goals
• A world-class system to provide reliable production
supercomputing to the NCAR Community and Climate
Simulation Laboratory
• A phased introduction of new, state-of-the-art computational,
storage and communications technologies through the life of
the contract (3-5 years)
• First equipment delivered Friday, 5 October
ARCS Timetable
Date
System
Node Type
Processor
Oct 2001
blackforest upgrade
Winterhawk-2
& Nighthawk-2
375 MHz
POWER3-II
Sep 2002
bluesky with Colony
Switch
Regatta
~1.35 GHz
POWER4
Sep-Dec 2003
Federation Switch Upgrade
(blackforest removed after Federation acceptance)
3-Year Contract
2-Year Extension Option
Sep-Dec 2004
bluesky upgrade
Armada
~2.0 GHz
POWER4-GP
ARCS Capacities
Date
System
Minimum
Total Disk
Capacity
(TB)
Total
Memory
(TB)
Peak TFLOPs
New
(Total)
3-Year Contract
Oct 2001
blackforest upgrade
10.5
0.75
1.1
(2.0)
Sep 2002
bluesky with Colony
Switch
33
2.8
5.81+
(6.81+)
Sep-Dec 2003
Federation Switch Upgrade
2-Year Extension Option
Sep-Dec 2004
bluesky upgrade
65
+ Negotiated capability commitments may require
installation of additional capacity.
3.8
8.75+
(8.75+)
ARCS Commitments
• Minimum Model Capability Commitments
– blackforest upgrade
– bluesky
– bluesky upgrade
1.0x (defines ‘x’)
3.1x
4.6x
Failure to meet these commitments will result in IBM installing
additional computational capacity
• Improved user environment functionality, support
and problem resolution response
• Early access to new hardware & software
technologies
• NCAR’s participation in IBM’s “Blue Light”
exploratory supercomputer project (PFLOPs)
ARCS, SCD Roadmap Goal, and Moore's Law
1.2
Estimated Sustained TFLOPs
1.0
bluesky
upgrade
0.8
blackforest
deinstall
Likely path to
TFLOP goal
Goal
Federation
install
0.6
bluesky
install
Moore's Law
0.4
ARCS Contract
blackforest
upgrade
0.2
0.0
Jul-01
Jul-02
Jul-03
Jul-04
Jul-05
Proposed Equipment
- IBM
ARO+60
Sep 2002
164 WH2/4
5 NH2/16
+120 POWER4
MI SMP/8
Processor
375 MHz POWER3
1.35 GHz POWER4
Interconn.
TBMX
180MB/s; 22 usec
Colony/NH2 Adapter†
345MB/s; 17 usec
Peak TF
1.1
6.6
Mem (TB)
0.45
2.49
Disk (TB)
23.5
44.5
Nodes
System Software: PSSP/AIX, JFS/GPFS, LoadLeveler
†
Federation switch (2400 MB/s, 4 usec) option 2H03
ARCS Roadmap
Oct ’01
Oct ’02
Oct ’03
Oct ‘04
blackforest
Upgrade
bluesky
Installation
Federation
Upgrade
bluesky
Upgrade
bluesky
4.8+ TFLOPs peak
2.8 TB memory
21 TB GPFS disk
Colony Switch
3 NH2/16pe – P3
POWER4/~1.35 GHz
P4 Node/pe #s TBD
~2.0 GB memory/pe
bluesky
4.8+ TFLOPs peak
2.8 TB memory
21 TB GPFS disk
Federation Switch
NH2 removed
POWER4/~1.35 GHz
P4 Node/pe #s TBD
~2.0 GB memory/pe
bluesky
8.75+ TFLOPs peak
3.8 TB memory
65 TB GPFS disk
Federation Switch
POWER4-GP/~2.0 GHz
P4 Node/pe #s TBD
~3.0 memory/pe
blackforest
2.0 TFLOPs peak
0.73 TB memory
10.5 TB GPFS disk
TBMX Switch
POWER3-II/375
MHz
315 WH2/4pe
NH2 to bluesky
512 MB memory/pe
blackforest
2.0 TFLOPs peak
0.73 TB memory
10.5 TB GPFS disk
TBMX Switch
POWER3-II/375 MHz
315 WH2/4pe
512 MB memory/pe
blackforest
2.0 TFLOPs peak
0.73 TB memory
10.5 TB GPFS disk
TBMX Switch
POWER3-II/375 MHz
315 WH2/4pe
3 NH2/16pe
512 MB memory/pe
“TFLOP Option”
SCD will likely augment
bluesky with additional
POWER4 nodes when
blackforest is
decommissioned
Thank you all for
attending CAS
2001
See you all in
2003!