FAST-OS BOF SC 04 http://www.cs.unm.edu/~fastos Follow link to subscribe to the mail list.

Download Report

Transcript FAST-OS BOF SC 04 http://www.cs.unm.edu/~fastos Follow link to subscribe to the mail list.

FAST-OS BOF SC 04
http://www.cs.unm.edu/~fastos
Follow link to subscribe to the mail
list
Projects
• Colony
Terry Jones, LLNL
• Config Framework
Ron Brightwell, SNL
• DAiSES
Pat Teller, UTEP
• K42
Paul Hargrove, LBNL
• MOLAR
Stephen Scott, ORNL
• Peta-Scale SSI
Scott Studham, ORNL
• Rightweight Kernels
Ron Minnich, LANL
• Scalable FT
Jarek Nieplocha, PNNL
• SmartApps
L. Rauchwerger, T A&M
• ZeptoOS
Pete Beckman, ANL
www.HPC-Colony.org
Services & Interfaces
For Very Large Linux Clusters
Terry Jones, LLNL, Coordinating PI
Laxmikant Kale, UIUC, PI
Jose Moreira, IBM, PI
Celso Mendes, UIUC
Derek Lieber, IBM
Colony
Overview
Collaborators
Lawrence Livermore
National Laboratory
Title
Services and Interfaces to Support
Systems with Very Large Numbers
of Processors
Topics
•
University of Illinois at
Urbana-Champaign
International Business
Machines
•
•
•
•
•
•
Parallel Resource Instrumentation
Framework
Scalable Load Balancing
OS mechanisms for Migration
Processor Virtualization for Fault
Tolerance
Single system management space
Parallel Awareness and Coordinated
Scheduling of Services
Linux OS for cellular architecture
Colony
Motivation
• Parallel resource management
Strategies for scheduling and load balancing must be
improved. Difficulties in achieving a balanced partitioning
and dynamically scheduling workloads can limit scaling for
complex problems on large machines.
• Global system management
System management is inadequate. Parallel jobs require
common operating system services, such as process
scheduling, event notification, and job management to
scale to large machines.
Colony
Goals
•
Develop infrastructure and strategies for automated parallel resource
management
– Today, application programmers must explicitly manage these resources.
We address scaling issues and porting issues by delegating resource
management tasks to a sophisticated parallel OS.
– “Managing Resources” includes balancing CPU time, network utilization,
and memory usage across the entire machine.
•
Develop a set of services to enhance the OS to improve its ability to
support systems with very large numbers of processors
– We will improve operating system awareness of the requirements of parallel
applications.
– We will enhance operating system support for parallel execution by
providing coordinated scheduling and improved management services for
very large machines.
Colony
Approach
• Top Down
– Our work will start from an existing full-featured OS and remove excess
baggage with a “top down” approach.
• Processor virtualization
– One of our core techniques: the programmer divides the computation into a
large number of entities, which are mapped to the available processors by an
intelligent runtime system.
• Leverage Advantages of Full Featured OS & Single System Image
– Applications on these extreme-scale systems will benefit from extensive
services and interfaces; managing these complex systems will require an
improved “logical view”
• Utilize Blue Gene
– Suitable platform for ideas intended for very large numbers
of processors
Configurable OS Framework
• Sandia, lead
– Ron Brightwell, PI
– Rolf Riesen
• Caltech
– Thomas Sterling, PI
• UNM
– Barney Maccabe, PI
– Patrick Bridges
Issues
• Novel architectures
– Lots of execution environments
• Programming models
– MPI, UPC, separating processing from location
• Shared services
– File systems, shared WAN
• Usage model
– Dedicated, space shared, time shared
Approach
• Build application specific OS
– Architecture, programming model, shared
resources, usage model
• Develop a collection of Micro services
– Compose and distribute
• Compose services
– Services may adapt
• Kinds of services
– Memory allocation, signal delivery, message
receipt and handler activation
The Picture
Challenges
• How to reason about combinations
• Dependencies among services
• Efficiency
– Overhead associated with transfers
between micro services
• How many operating systems will we
really need?
Goals
Dynamic Adaptability in Support of Extreme Scale
Generalized
Customized
resource management
Fixed
Dynamically Adaptable
OS/runtime services
Enhanced Performance
Challenges
Dynamic Adaptability in Support of Extreme Scale
Determining
• What to adapt
• When to adapt
• How to adapt
• How to measure effects of adaptation
Deliverables
Dynamic Adaptability in Support of Extreme Scale
• Develop mechanisms to dynamically sense,
analyze, and adjust common performance
metrics, fluctuating workload situations, and
overall system environment conditions
• Demonstrate, via Linux prototypes and
experiments, dynamic self-tuning/provisioning
in HPC environments
• Develop a methodology for general-purpose
OS adaptation
Methodology
Dynamic Adaptability in Support of Extreme Scale
identify adaptation
targets
characterize workload
resource usage patterns
potential adaptation targets
off line
(re)determine adaptation intervals
off line/
run time
define/adapt heuristics to
trigger adaptation
generate/adapt monitoring, triggering and
adaptation code, and attach it to OS
monitor application execution, triggering
adaptation as necessary
KernInst
KernInst
Dynamic Adaptability in Support of Extreme Scale
dynamic instrumentation of the kernel
IBM pSeries
eServer 690
Client
KernInst Daemon
Instrumentation
Tool
KernInst Device
KernInst API
Linux Kernel
• KernInst and Kperfmon provide the capability to perform dynamic
monitoring and adaptation of commodity operating systems.
• University of Wisconsin’s KernInst and Kperfmon make the problem
of run-time monitoring and adaptation more tractable.
Example Adaptations
Dynamic Adaptability in Support of Extreme Scale
Customization of
• process scheduling parameters and algorithms, e.g.,
scheduling policy for different job types (prototype in
process)
• file system cache size and management
• disk cache management
• size of OS buffers and tables
• I/O, e.g., checkpoint/restart
• memory allocation and management parameters and
algorithms
Partners
Dynamic Adaptability in Support of Extreme Scale
University of Texas at El Paso
Department of Computer Science
Patricia J. Teller ([email protected])
University of Wisconsin — Madison
Computer Sciences Department
Barton P. Miller ([email protected])
International Business Machines, Inc.
Linux Technology Center
Bill Buros ([email protected])
U.S. Department of Energy
Office of Science
Fred Johnson ([email protected])
C
O
M
P
U
T
A
T
I
O
N
A
L
R
E
S
E
A
R
C
H
D
I
V
I
S
I
O
N
High End Computing with K42
Paul H. Hargrove and
Katherine Yelick
Lawrence Berkeley National Lab
Angela Demke Brown and
Michael Stumm
University of Toronto
Patrick Bridges
University of New Mexico
Orran Krieger and
Dilma Da Silva
IBM
K42
Project Motivation
• The HECRTF and FastOS reports enumerate unmet
needs in the area of Operating Systems for HEC,
including
–
–
–
–
–
Availability of Research Frameworks
Support for Architectural Innovation
Performance Visibility
Ease of Use
Adaptability to Application Requirements
• This project uses the K42 Operating System to
address these five needs
K42
K42 Background
• K42 is a research OS from IBM
– API/ABI compatibility with Linux
– Designed for large 64-bit SMPs
– Extensible object-oriented design
• Features per resource-instance objects
• Can change implementation/policy for individual instances at
runtime
– Extensive performance-monitoring
– Many traditional OS functions are performed in user-space
libraries
K42
What Work Remains? (1 of 2)
• Availability of Research Frameworks & Support for Architectural
Innovation
 K42 is already a research platform, used by IBM for their PERCS
project (DARPA HPCS) to support architectural innovation
 Work remains to expand K42 from SMPs to clusters
• Performance Visibility
 Existing facilities are quite extensive
 Work remains to use runtime replacement of object
implementations to monitoring single objects for fine-grained control
K42
What Work Remains? (2 of 2)
• Ease of Use
 Work remains to make K42 widely available, and to bring
HEC user environments to K42 (e.g. MPI, batch systems,
etc.)
• Adaptability to Application Requirements
 Runtime replacement of object implementations provides
extreme customizability
 Work remains to provide implementations appropriate to
HEC, and to perform automatic dynamic adaptation
MOLAR: Modular Linux and Adaptive Runtime Support
for High-end Computing Operating and Runtime
Systems
Coordinating Principal Investigator
Stephen L. Scott, ORNL
[email protected]
Principal Investigators
J. Vetter, D.E. Bernholdt, C. Engelmann – ORNL
C. Leangsuksun – Louisiana Tech University
P. Sadayappan – Ohio State University
F. Mueller – North Carolina State University
Collaborators
A.B. Maccabe – University of New Mexico
C. Nuss, D. Mason – Cray Inc.
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
MOLAR
MOLAR research goals
•
Create a modular and configurable Linux system that allows customized
changes based on the requirements of the applications, runtime systems, and
cluster management software.
•
Build runtime systems that leverage the OS modularity and configurability to
improve efficiency, reliability, scalability, ease-of-use, and provide support to
legacy and promising programming models.
•
Advance computer reliability, availability and serviceability (RAS) management
systems to work cooperatively with the OS/R to identify and preemptively
resolve system issues.
•
Explore the use of advanced monitoring and adaptation to improve application
performance and predictability of system interruptions.
MOLAR
MOLAR: Modular
Linux and Adaptive
Runtime support
High-end
Computing
OS Research
Map
HEC Linux OS: modular, custom, light-weight
Kernel design
Performance
Observation
Communications, IO
Monitoring
Extend/adapt
runtime/OS
Root cause
analysis
RAS
High availability
Testbeds
Provided
PROBLEM:
• Current OSs and runtime systems (OS/R) are unable to meet the various requirements to run large
applications efficiently on future ultra-scale computers.
GOALS:
• Development of a modular and configurable Linux framework.
• Runtime systems to provide a seamless coordination between system levels.
• Monitoring and adaptation of the operating system, runtime, and applications.
• Reliability, availability, and serviceability (RAS)
• Efficient system management tools.
IMPACT:
• Enhanced support and better understanding of extremely scalable architectures.
• Proof-of-concept implementation open to community researchers.
MOLAR
MOLAR crosscut capability deployed for RAS
• Monitoring Core Daemon
• service monitor
• resource monitor
• hardware health monitor
• Head nodes: active / hot standby
• Services: active / hot standby
• Modular Linux systems
deployment & development
MOLAR
MOLAR Federated System Management (fSM)
• fSM emphasizes simplicity
• self-build
• self-configuration
• self-healing
• simplified operation
• Expand MOLAR support:
• Investigate specialized
architectures
• Investigate other
environments & OSs
• Head nodes: active / active
• Services: active / active
Peta-Scale Single-System Image
A framework for a single-system image Linux environment
for 100,000+ processors and multiple architectures
Coordinating Investigator
R. Scott Studham, ORNL
Principal Investigators
Alan Cox, Rice University
Bruce Walker, HP
Investigators
Peter Druschel, Rice University
Scott Rixner, Rice University
Collaborators
Peter Braam, CFS
Steve Reinhardt, SGI
Stephen Wheat, Intel
Peta-Scale SSI
Project Key Objectives











OpenSSI to 10,000 nodes
Integration of OpenSSI with nodes with high processor counts
The scalability of a shared root filesystem to 10,000 nodes
Scalable booting and monitoring mechanisms
Research enhancements to OpenSSI’s P2P communications
The use of very large page sizes (superpages) for large address spaces
Determine the proper interconnect balance as it impacts the operating
system (OS)
Establish system-wide tools and process management for a 100,000
processor environment
OS noise (services that interrupt computation) effects
Integrating a job scheduler with the OS
Preemptive task migration.
Peta-Scale SSI
Reduce OS-Noise and increase
cluster scalability via efficient compute nodes
Install and
sysadmin
Boot and Init
Devices
IPC
Application
monitoring
and restart
HA Resource
Mgmt and
Job
Scheduling
MPI
Boot
Process
load leveling
CLMS
Cluster
Filesystem
CFS
MPI
DLM
Vproc
Remote File Block
Lustre
client
ICS
Service Nodes
single install;
local boot (for HA);
single IP (LVS)
connection load balancing (LVS);
single root with HA (Lustre):
single file system namespace (Lustre);
single IPC namespace;
single process space and process load leveling;
application HA
strong/strict membership;
CLMS
Lite
LVS
Vproc
Lustre
client
Remote File Block
ICS
Compute Nodes
single install;
network or local boot;
not part of single IP and no connection load balance
single root with caching (Lustre);
single file system namespace (Lustre);
no single IPC namespace (optional);
single process space but no process load leveling;
no HA participation;
scalable (relaxed) membership;
inter-node communication channels on demand only
Peta-Scale SSI
Researching the intersection of SSI
and large kernels to get to 100,000+ processors
2048 CPUs
2) Enhance
scalability of
both approaches
3) Understand
intersection of
both methods
Single Linux Kernel
1) Establish
scalability baselines
Continue SGIs work on
single kernel scalability
Test the intersection large
kernels with software OpenSSI
to establish
sweat spot for
Continue the
OpenSSI’s
100,000
processor
Linux
work on
Typical
SSISSI scalability
environments
Stock Linux Kernel
1 CPU
1 Node
Software SSI Clusters
10,000 Nodes
Right-Weight Kernels
The right kernel, in
the right place, at
the right time
RWK
OS effect on Parallel
Applications
• Simple problem: if all processors save one
arrive at a join, then all wait for the laggard
[Mraz SC ’94]
– Mraz resolved the problem for AIX, interestingly,
with purely local scheduling decisions (i.e., no
global scheduler)
– Sandia resolved it by getting rid of the OS entirely
(i.e., creation of the “Light-Weight Kernel”)
• AIX has more capability than many apps
need
• LWK has less capability than many apps want
RWK
Hence Right-Weight Kernels
• Customize the kernel to the app
• We’re looking at two different approaches
• Customized, Modular Linux
– Based on 2.6
– With some scheduling enhancements
• “COTS” Secure LWK
– Based, after some searching, on Plan 9
– With some performance enhancements
RWK
Balancing Capability and
Overhead
increasing per node capability
RWK
AIX,
Tru64,
Solaris,
Linux, etc.
RWK
No OS
RWK
decreasing OS impact on app
• We need to balance the capabilities that an full OS
gives the user with the overhead of providing such
services
• For a given app, we want to be as close to the
“optimal” balance as possible
• But how do we measure what that is?
RWK
Measuring what is “good”
• OS activity is periodic, thus we need to use
techniques such as time series analysis to
evaluate the measured data
– Use this data to figure out what is “good” and
“bad”
• Caveat: you must practice good sampling
hygiene [Sottile & Minnich, Cluster ’04]
– Must follow rules of statistical sampling
– Measuring work per unit of time leads to
statistically sound data
– Measuring time per unit of work leads to
meaningless data
RWK
Conclusions
• Use sound statistical measurement
techniques to figure out what is “good”
• Configure compute nodes on a per app basis
(Right-Weight Kernel)
• Rinse and repeat!
• Collaborators
– Sung-Eun Choi, Matt Sottile, Erik Hendriks (LANL)
– Eric Grosse, Jim McKie, Vic Zandy (Bell Labs)
SFT: Scalable Fault Tolerant Runtime
and Operating Systems
Pacific Northwest National Laboratory
Los Alamos National Laboratory
University of Illinois
Quadrics
SFT
Team
• Jarek Nieplocha, PNNL
• Fabrizio Petrini and Kei Davis (LANL)
• Josep Torrellas and Yuanyuan Zhou
(UIUC)
• David Addison (Quadrics)
• Industrial Partner: Stephen Wheat (Intel)
SFT
Motivation
• With the massive number of components comprising the
forthcoming petascale computer systems, hardware failures will
be routinely encountered during execution of large-scale
applications.
• Application Driver
– Multidisciplinary, multiresolution, and multiscale nature of scientific
problems
– drive the demand for high end systems
– applications place increasingly differing demands on the system
resources: disk, network, memory, and CPU.
• Therefore, it will not be cost-effective or practical to rely on a
single fault tolerance approach for all applications.
SFT
Goals
• Develop scalable and practical
techniques for addressing fault
tolerance at the Operating System and
Runtime levels
– Design based on requirements of DoE
applications
– Minimal impact on application performance
SFT
Petaflop Architecture
...
...
processors
interconnection
network
memories
Tightly coupled node
Globally addressable but
non-coherent between nodes
SFT
Scope
• We will investigate, develop, and evaluate a
comprehensive range of techniques for fault
tolerance.
– System level incremental checkpointing approach
• based on Buffered CoScheduling
• temporal and spatial hybrid checkpointing
• in-memory checkpointing and efficient handling of I/O
– Fault awareness in communication libraries
• while exploiting high performance network communication
• MPI, ARMCI
• scalability
– Feasibility analysis of incremental checkpointing
SFT
Buffered CoScheduling
SmartApps:
Middleware for Adaptive
Applications on
Reconfigurable Platforms
Lawrence Rauchwerger
http://parasol.tamu.edu/~rwerger/
Parasol Lab, Dept of Computer Science, Texas A&M
SmartApps
Today: System Centric Computing
System-Centric Computing
Classic avenues to
performance:
•Parallel Algorithms
•Static Compiler Optimization
•OS support
•Good Architecture
Application
(algorithm)
Development,
Analysis &
Optimization
Application
Compiler
Compiler
(static)
OS
HW
Execution
System
(OS & Arch)
Input Data
WHAT’s MISSING ?
•Compilers are conservative
No Global Optimization
•OS offers generic services
•No matching between Application/OS/HW
•Architecture is generic
•intractable for the general case
SmartApps
Our Approach: SmartApps
Application Centric Computing
Application
(algorithm)
Application-Centric Computing
Development,
Analysis &
Optimization
HW
Compiler (static) +
run-time techniques
Compiler
Run-time System:
Execution, Analysis
& Optimization
SmartApp
Compiler
(run-time)
OS
Application
OS
(modular)
Input Data
Architecture
(reconfigurable)
Application Control
Instance-specific optimization
Compiler + OS + Architecture + Data + Feedback
STAPL Application
SmartApps
SmartApps
Architecture
DataBase
Static STAPL Compiler Predictor &
Augmented with
Optimizer
runtime techniques
Compiled code +
runtime hooks
Smart Application
Get Runtime Information
(Sample input, system information, etc.)
Compute Optimal Application
and RTS + OS Configuration
advanced
stages
development
stage
Toolbox
Large adaptation
(failure, phase change)
Recompute Application
Configurer
and/or Reconfigure RTS + OS
Adaptive Software
Adaptive RTS+ OS
Runtime tuning
(w/o recompile)
Predictor &
Evaluator
Execute Application
Continuously monitor
performance and adapt
as necessary
Small adaptation (tuning)
Predictor &
Evaluator
Predictor &
Optimizer
SmartApps
SmartApps written in STAPL
• STAPL (Standard Template Adaptive Parallel Library):
– Collection of generic parallel algorithms, distributed
containers & run-time system (RTS)
– Inter-operable with Sequential Programs
– Extensible, Composable by end-user
– Shared Object View: No explicit communication
– Distributed Objects: no replication/coherence
– High Productivity Environment
SmartApps
The STAPL Programming Environment
User Code
pAlgorithms
pContainers
pRange
RTS + Communication Library (ARMI)
Interface to OS (K42)
OpenMP/MPI/pthreads/native
SmartApps
SmartApps to RTS to OS
Specialized Services from Generic OS Services
– OS offers one size fits all services.
– IBM K42 offers customizable services
– We want customized services BUT…. we do not want to
write them
Interface between SmartApps(RTS) & OS(k42)
• Vertical integration of Scheduling/Memory
Management
SmartApps
Collaborative Effort:
• STAPL (Amato/Rauchwerger)
• STAPL Compiler
(Rauchwerger/Stroustrup/Quinlan)
• RTS – K42 Interface & Optimizations
(Krieger/Rauchwerger)
• Applications (Amato/Adams/ others)
• Validation on DOE extreme HW
BlueGene (Moreira) , possibly PERCS
(Krieger/Sarkar)
Texas A&M (Parasol, NE) + IBM + LLNL
ZeptoOS
Studying Petascale Operating Systems
with Linux
Argonne National Laboratory
Pete Beckman
Bill Gropp
Rusty Lusk
Susan Coghlan
Suravee Suthikulpanit
University of Oregon
Al Malony
Sameer Shende
ZeptoOs
Observations:
• Extremely large systems run an
“OS Suite”
– BG/L and Red Storm both have at least
4 different operating system flavors
• Functional Decomposition trend
lends itself toward a customized,
optimized point-solution OS
• Hierarchical Organization
requires software to manage
topology, call forwarding, and
collective operations
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
ZeptoOs
ZeptoOS
• Investigating 4 key areas:
– Linux as an ultra-lightweight kernel
• Memory mgmt, scheduling efficiency, network
– Collective OS calls
• Explicit collective behavior may be key (DLLs?)
– OS Performance monitoring for
hierarchical systems
– Fault tolerance
ZeptoOs
Linux as a Lightweight Kernel
What does an OS steal from a selfish CPU application?
•
•
Purpose: Micro benchmark
measuring CPU cycles provided to
benchmark application
Helps understand “MPI-reduce
problem” and gang scheduling
issues
ZeptoOs
Collective OS Calls
• Collective messaging passing calls
have been very efficiently implemented
on many architectures
• Collective I/O calls permit scalable,
efficient (non-Posix) file I/O
• Collective OS calls, such as dynamically
loading libraries, may provide scalable
OS functionality
ZeptoOs
Scalable OS Performance Monitoring
(U of Oregon)
• TAU provides a framework for scalable
performance analysis
• Integration of TAU into hierarchical systems,
such as BG/L, will all us to explore:
– Instrumentation of light-weight kernels
• Call forwarding, memory, etc
– Intermediate, parallel aggregation of performance
data at I/O nodes
– Integration of data from the OS Suite
ZeptoOs
Exploring Faults: Faulty Towers
Memory
•
•
•
Dial-a-Disaster
MPI/Net
Kernel
Disk
Middleware
Modify Linux so we can selectively and predictably break things
Run user code, middleware, etc at ultra scale, with faults
Explore metrics for codes with good “survivability”
It’s not a bug, it’s a feature!
Simple Counts
• OSes (4): Linux (6.5), K-42 (2), Custom (1),
Plan 9 (.5)
• Labs (7): ANL, LANL, ORNL, LBNL, LLNL,
PNNL, SNL
• Universities: Caltech, Louisiana Tech, NCSU,
Rice, Ohio State, Texas A&M, Toronto, UIUC,
UTEP, UNM, U of Chicago, U of Oregon, U of
Wisconsin
• Industry: Bell Labs, Cray, HP, IBM, Intel, CFS
(Lustre), Quadrics, SGI
Apple Pie
• Open source
• Partnerships: Labs, universities, and industry
• Scope: basic research, applied research,
development, prototypes, testbed systems,
and deployment
• Structure: “don’t choose a winner too early”
– Current or near-term problems -- commonly used,
open-source Oses (e.g., Linux or FreeBSD)
– Prototyping work in K42 and Plan 9
– At least one wacko project (explore novel ideas
that don’t fit into an existing framework)
A bit more interesting
• Virtualization
– Colony
• Adaptability
– DAiSES, K42, MOLAR, SmartApps
– Config, RWK
• Usage model & system mgmt (OS Suites)
– Colony, Config, MOLAR, Peta-scale SSI, Zepto
• Metrics & Measurement
– HPC Challenge (http://icl.cs.utk.edu/hpcc/)
– DAiSES, K42, MOLAR, RWK, Zepto
• Fault handling
– Colony, MOLAR, Scalable FT, Zepto
continued
• Managing the memory hierarchy
• Security
• Common API
– K42, Linux
• Single System Image
– Peta-scale SSI
• Collective Runtime
– Zepto
• I/O
– Peta-scale SSI
• OS Noise
– Colony, Peta-scale SSI, RWK, Zepto
Application Driven
• Meet the application developers
– OS presentations
– Apps people panic -- what are you doing to
my machine?
– OS people tell ‘em what we heard
– Apps people tell us what we didn’t
understand