Working Group updates, Suite Tests and Scalability, Race conditions,

Transcript Working Group updates, Suite Tests and Scalability, Race conditions,

Working Group updates, Suite Tests and
Scalability, Race conditions,
SSS-OSCAR Releases and Hackerfest
Al Geist
August 17-19, 2005
Oak Ridge, TN
Welcome to Oak Ridge National Lab!
First quarterly meeting here
Demonstration:
Faster than Light Computer
Able to calculate the answer
before the problem is specified
ORNL
ORNL
ORNL
ORNL
ORNL
Faster than Light Computer
Scalable Systems Software
Participating
Organizations
ORNL
ANL
LBNL
PNNL
SNL
LANL
Ames
IBM
Cray
Intel
SGI
NCSA
PSC
Problem
• Computer centers use incompatible,
ad hoc set of systems tools
• Present tools are not designed to
scale to multi-Teraflop systems
Resource
Management
Accounting
& user mgmt
Goals
• Collectively (with industry) define
standard interfaces between systems
components for interoperability
• Create scalable, standardized
management tools for efficiently
running our large computing centers
System
Monitoring
To learn more visit
www.scidac.org/ScalableSystems
Job management
System
Build &
Configure
Scalable Systems Software Suite
Any Updates to this diagram?
Grid Interfaces
Components written in any
mixture of C, C++, Java, Perl,
and Python can be
integrated into the Scalable
Systems Software Suite
Accounting
Meta
Scheduler
Meta
Monitor
Meta
Manager
Meta Services
Scheduler
System &
Job Monitor
Node State
Manager
Service
Directory
Standard
XML
interfaces
authentication
communication
Allocation
Management
Usage
Reports
Event
Manager
Process
Manager
Job Queue
Manager
Validation
& Testing
Node
Configuration
& Build
Manager
Checkpoint /
Restart
Hardware
Infrastructure
Manager
SSS-OSCAR
Components in Suites
Multiple
Component
Implementations
exits
Grid
scheduler
Meta
Manager
Warehouse
Meta Services
NSM
Maui
sched
Gold
Warehouse
(superMon
NWPerf)
SD
ssslib
BCM
EM
Gold
Usage
Reports
Compliant with
PBS, Loadlever
job scripts
APITest
PM
Bamboo
QM
BLCR
HIM
Scalable Systems Users
Production use today:
• Running an SSS suite at ANL, and Ames
• ORNL industrial cluster (soon)
• Running components at PNNL
• Maui w/ SSS API (3000/mo), Moab (Amazon, Ford, TeraGrid, …)
Who can we involve before the end of the project?
- National Leadership-class facility?
NLCF is a partnership between
ORNL (Cray), ANL (BG), PNNL (cluster)
- NERSC and NSF centers
NCSA cluster(s)
NERSC cluster?
NCAR BG
Goals for This Meeting
Updates on the Integrated Software Suite components
Change in Resource Management Group
Scott Jackson left PNNL
Planning for SciDAC phase 2 –
discuss new directions
Preparing for next SSS-OSCAR software suite release
What needs to be done at hackerfest?
Getting more outside Users.
Production and feedback to suite
Since Last Meeting
• FastOS Meeting in DC
• Any chatting about leveraging our System Software?
• SciDAC 2 Meeting in San Francisco
• Scalable Systems Poster
• Talk on ISICS
• Several SSS members there. Anything to report?
• Telecoms and New entries in Electronic Notebooks
• Pretty sparse since last meeting
Agenda – August 17
8:00 Continental Breakfast CSB room B226
8:30 Al Geist - Project Status
9:00 Craig Steffen – Race Conditions in Suite
9:30 Paul Hargrove Process Management and Monitoring
10:30 Break
11:00 Todd Kordenbrock – Robustness and Scalability Testing
12:00 Lunch (on own at cafeteria )
1:30 Brett Bode - Resource Management components
2:30 Narayan Desai - Node Build, Configure, Cobalt status
3:30 Break
4:00 Craig Steffen – SSSRMAP in ssslib
4:30 Discuss proposal ideas for SciDAC 2
4:30 Discussion of getting SSS users and feedback
5:30 Adjourn for dinner
Agenda – August 18
8:00 Continental Breakfast
8:30 Thomas Naughton - SSS OSCAR software releases
9:30 Discussion and voting
• Your name here
10:30 Group discussion of ideas for SciDAC-2.
11:30 Discussion of Hackerfest goals
Set next meeting date/location:
12:00 Lunch (walk over to cafeteria)
1:30 Hackerfest begins room B226
3:00 Break
5:30 or whenever break for dinner
Agenda – August 19
8:00 Continental Breakfast
8:30 Hackfest continues
12:00 Hackerfest ends
What is going on in
SciDAC 2
Executive Panel
Five Workshops in past 5 wks
Preparing a SciDAC 2 program plan
at LBNL today!
ISIC section has words about
system software and tools
View to the Future
HW, CS, and Science Teams all contribute to the science breakthroughs
Leadership-class
Platforms
Software & Libs
SciDAC CS teams
Computing Environment
Common look&feel across diverse HW
SciDAC
Science Teams
Tuned codes
High-End
Research
science problem team
Ultrascale
Hardware
Rainer, Blue Gene,
Red Storm OS/HW teams
Breakthrough
Science
SciDAC Phase 2 and CS ISICs
Future CS ISICs need to be mindful of needs of
National Leadership Computing facility
w/ Cray, IBM BG, SGI, clusters, multiple OS
No one architecture is best for all applications
SciDAC Science Teams
Needs depend on application areas chosen
End stations? Do they have special SW needs?
FastOS Research Projects
Complement, don’t duplicate these efforts
Cray software roadmap
Making the Leadership computers usable, efficient, fast
Gaps and potential next steps
Heterogeneous leadership-class machines
science teams need to have a robust environment that presents similar
programming interfaces and tools across the different machines.
Fault tolerance requirements in apps and systems software
particularly as systems scale up to petascale around 2010
Support for application users submitting interactive jobs
computational steering as means of scientific discovery
High performance File System and I/O research
increasing demands of security, scalability, and fault tolerance
Security
One-time-passwords and impact on scientific progress
Heterogeneous Machines
Heterogeneous Architectures
Vector architectures, Scalar, SMP, Hybrids, Clusters
How is a science team to know what is best for them?
Multiple OS
Even within one machine, eg. Blue Gene, Red Storm
How to effectively and efficiently administer such systems?
Diverse programming environment
science teams need to have a robust environment that presents similar
programming interfaces and tools across the different machines
Diverse system management environment
Managing and scheduling multiple node types
System updates, accounting, … everything will be harder in round 2
Fault Tolerance
Holistic Fault Tolerance
Research into schemes that take into account the full impact of faults:
application, middleware, OS, and hardware
Fault tolerance in systems software
• Research into prediction and prevention
• Survivability and resiliency when faults can not be avoided
Application recovery
• transparent failure recovery
• Research into Intelligent checkpointing based on active monitoring,
sophisticated rule-based recoverys, diskless checkpointing…
• For petascale systems research into recovery w/o checkpointing
Interactive Computing
Batch jobs are not the always the best for Science
Good for large numbers of users, wide mix of jobs, but
National Leadership Computing Facility has different focus
Computational Steering as a paradigm for discovery
Break the cycle: simulate, dump results, analyze, rerun simulation
More efficient use of the computer resources
Needed for Application development
Scaling studies on terascale systems
Debugging applications which only fail at scale
File System and I/O Research
Lustre is today’s answer
There are already concerns about its capabilities as systems scale up
to 100+ TF
What is the answer for 2010?
Research is needed to explore the file system and I/O requirements for
petascale systems that will be here in 5 years
I/O continues to be a bottleneck in large systems
Hitting the memory access wall on a node
To expensive to scale I/O bandwidth with Teraflops across nodes
Research needed to understand how to structure applications or modify
I/O to allow applications to run efficiently
Security
New stricter access policies to computer centers
Attacks on supercomputer centers have gotten worse.
One-Time-Passwords, PIV?
Sites are shifting policies, tightening firewalls, going to SecureID tokens
Impact on scientific progress
Collaborations within international teams
Foreign nationals clearance delays
Access to data and computational resources
Advances required in system software
To allow compliance with different site policies and be able to handle
tightest requirements
Study how to reduce impact on scientists
Meeting notes
Al Geist – see slides
Craig Steffen – Exciting new race condition!
Nodes go offline – Warehouse doesn’t know quick enough
Event manager, scheduler, lots of components affected
Problem grows linear with system size. C
Order of operations need to be considered – something we haven’t
considered before. Issue can be reduced, can’t be solved
Good discussion on ways to reduce race conditions.
SSS use at NCSA
Paul Egli rewrote Warehouse – many new features added, Sandia user
Now monitoring sessions
All configuration is dynamic
Multiple debugging channels
Sandia user – tested to 1024 virtual nodes
Web site – http://arrakis.ncsa.uiuc.edu/warehouse/
New hire full time on SSS
Lining up T2 scheduling (500 proc)
Meeting notes
Paul Hargrove – Checkpoint Manager BLCR status
AMD64/EM64T port now in beta (crashes some users machines)
Recently discovered kernel panic during signal interaction
(must fix at hackerfest)
Next step process groups/sessions – begin next week
LRS-XML and Events “real soon now”
Open MPI chpt/restart support by SC2005
Torque integration done at U. Mich. for phd thesis (needs hardening)
Process manager – MPD rewrite “refactoring”
Getting a PM stable and working on BG.
Todd K – Scalability and Robustness tests
ESP2 Efficiency ratio 0.9173 on 64 nodes
Scalability – Bamboo 1000 job submission
Gold (java version) – reservation slow – PERL version not tested
Warehouse – up to 1024 nodes
Maui on 64 nodes (need more testing)
Durability – Node Warm stop – 30 seconds to Maui notification
Node Warm start – 10 seconds
Node Cold stop – 30 seconds
Meeting notes
Todd K – testing continued
Single node failure – good
Resource hog (stress)
Resource exhaustion – service node (Gold fails in logging package)
Anomalies
Maui
Warehouse
Gold
happynsm
ToDo
Test BLCR module
Retest on larger cluster
Get latest release of all software and retest
Write report on results.
Meeting notes
Brett Bode – RM status
New release of components
Bamboo v1.1
Maui 3.2.6p13
Gold 2b2.10.2
Gold being used on Utah cluster
SSS suite on several systems at Ames
New fountain component – to front end Supermon, ganglia, etc.
Demos new tool called Goanna for looking at fountain output
Has same interface as Warehouse – could plug right in
General release of GOLD 2.0.0.0 available. New perl cgi gui
no Java dependency at all in Gold
X509 support in Mcom (for Maui and Silver)
Cluster scheduler bunch of new features
Grid scheduler – enabled basic accounting for grid jobs.
Future work – Gary needs to get up to speed on Gold code
make it all work with LRS
Meeting notes
Narian – LRS Conversion status
All components in center cloud converted to LRS
Service directory, Event manager, BCM stack, Processor Manager
Targeted for SC05 release
SSSlib changeover – completed
SDK support – completed
Cobalt Overview
SSS suite on Chiba and BG
Motivations – scalability, flexibility, simplicity, support for research ideas
Tools included: parallel programming tools
Porting has been easy – now running on Linux, MacOS, and BG/L
Only about 5K lines of code.
Targeted for Cray xt3, x1, zeptoOS
Unique features- small partition support on BG/L, OS Spec support
Agile – swap out components. User and admin requests easier to satisfy
Running on ANL and NCAR (evaluation at other BG sites)
May be running on JAZZ soon.
Future- better scheduler, new platforms, more front ends, better docs
Meeting notes
Narian – Parallel tool development
-parallel Unix tools suite
-File staging
-Parallel rsync