to get the file - Center for Scalable Application Development Software

Download Report

Transcript to get the file - Center for Scalable Application Development Software

Efforts on Programming
Environment and Tools in China's
High-tech R&D Program
Depei Qian
Sino-German Joint Software Institute (JSI), Beihang University
Email: [email protected]
August 1, 2011, CScADS tools workshop
China’s High-tech Program

The National High-tech R&D Program (863
Program)



proposed by 4 senior Chinese Scientists and
approved by former leader Mr. Deng Xiaoping
in March 1986
One of the most important national science
and technology R&D programs in China
Now a regular national R&D program
planed in 5-year terms, the one just
finished is the 11th five-year plan
863 key projects on HPC and Grid

“High performance computer and core
software”





4-year project, May 2002 to Dec. 2005
100 million Yuan funding from the MOST
More than 2Χ associated funding from local
government, application organizations, and
industry
Outcomes: China National Grid (CNGrid)
“High productivity Computer and Grid
Service Environment”


Period: 2006-2010
940 million Yuan from the MOST and more than
1B Yuan matching money from other sources
HPC development (2006-2010)

First phase: developing two 100TFlops
machines



Dawning 5000A for SSC
Lenovo DeepComp 7000 for SC of CAS
Second phase: three 1000Tflops machines



Tianhe IA: CPU+GPU, NUDT/Tianjin
Supercomputing Center
Dawning 6000: CPU+GPU, ICT/Dawning/South
China Supercomputing Center (Shenzhen)
Sunway: CPU-only, Jiangnan/Shandong
Supercomputing Center
CNGrid development

11 sites












CNIC, CAS (Beijing, major site)
Shanghai Supercomputer Center (Shanghai, major site )
Tsinghua University (Beijing)
Institute of Applied Physics and Computational
Mathematics (Beijing)
University of Science and Technology of China (Hefei,
Anhui)
Xi’an Jiaotong University (Xi’an, Shaanxi)
Shenzhen Institute of Advanced Technology (Shenzhen,
Guangdong)
Hong Kong University (Hong Kong)
Shandong University (Jinan, Shandong)
Huazhong University of Science and Technology (Wuhan,
Hubei)
Gansu Provincial Computing Center
The CNGrid Operation Center (based on CNIC, CAS)
CNGrid GOS Architecture
Other Domain Specific Applications
GSML Workshop.
Cmd Line Tools
IDE Debugger Compiler
GSML
Composer
HPCG App & Mgmt Portal
Gsh & cmd tools
GSML
Browser
Tool/App
VegaSSH
System Mgmt Portal
Core, System and App Level
Services
GOS Library (Batch, Message, File, etc)
GOS System Call (Resource mgmt,Agora mgmt, User mgmt, Grip mgmt, etc)
HPCG Backend
Axis Handlers
for Message Level Security
CA Service
metainfo mgmt
File mgmt
BatchJob mgmt
Account mgmt
MetaSchedule
Message
Service
Dynamic
DeployService
Grip
DataGrid
GridWorkflow
DB Service
Work Flow
Engine
System
Tomcat(5.0.28) +
Axis(1.2 rc2)
Agora
Security
Resource Space
J2SE(1.4.2_07, 1.5.0_07)
Res AC & Sharing
Grip Instance Mgmt
User Mgmt
Agora Mgmt
Core
Res Mgmt
OS (Linux/Unix/Windows)
Naming
Grip Runtime
ServiceController
Other
RController
Tomcat(Apache)+Axis, GT4, gLite, OMII
Java J2SE
Grid Portal, Gsh+CLI, GSML
Workshop and Grid Apps
Other 3rd
software &
tools
Hosting
Environment
PC Server (Grid Server)
Jasmin: A parallel programming
Framework
Contact: Prof. Zeyao Mo, IAPCM, Beijing
[email protected]
Basic ideas
Library
Models
Special
Applications
Stencils
Codes
Algorithms
separate
Models
Common Stencils
Algorithms
extract
Data Dependency
form
Data Structures
Promote
Computers
Parallel
Computing
Models
Communications
support
Load Balancing
Infrastructure
parallel middlewares for
scientific computing
Basic ideas


Hides parallel programming using millons of cores and
the hierarchy of parallel computers;
Integrates the efficient implementations of parallel fast
numerical algorithms;

Provides efficient data structures and solver libraries;

Supports software engineering for code extensibility.
Basic Ideas
PetaFlops MPP
Applications
Codes
TeraFlops
Cluster
Serial
Programming
Personal
Computer
JASMIN
Structured
Grid
Inertial
Confinement
Fusion
Global
Climate
Particle
Modeling Simulation
CFD
Material
Simulations
J parallel
Adaptive
Structured Mesh
INfrastructure
JASMIN
http:://www.iapcm.ac.
cn/jasmin,
2010SR050446
2003-now
……
Unstructured
Grid
JASMIN
User provides: physics, parameters, numerical methods,
expert experiences, special algorithms, etc.
User Interfaces:Components based Parallel
Programming models. ( C++ classes)
JASMIN
Numerical Algorithms:geometry, fast solvers,
mature numerical methods, time integrators, etc.
V. 2.0
HPC implementations( thousands of CPUs):data
structures, parallelization, load balancing, adaptivity,
visualization, restart, memory, etc.
Architecture:Multilayered, Modularized, Object-oriented;
Codes: C++/C/F90/F77+MPI/OpenMP,500,000 lines;
Installation: Personal computers, Cluster, MPP.
JASMIN
Mesh
supported
Inertial Confinement Fusion: 2004-now
ICF Application Codes
13 codes,
46 researches,
concurrently develop
Different Combinations
numerical
methods
Simulation
Cycle



Physical
parameters
Expert
Experience
Hides parallel computing and adaptive implementations
using tens of thousands of CPU cores;
Provides efficient data structures, algorithms and solvers;
Support software engineering for code extensibility.
Numerical simulations on TianHe-1A
Codes
# CPU cores Codes
# CPU cores
LARED-S
32,768
RH2D
1,024
LARED-P
72,000
HIME3D
3,600
LAP3D
16,384
PDD3D
4,096
MEPH3D
38,400
LARED-R
512
MD3D
80,000
LARED Integration
128
RT3D
1,000
Simulation duration : several hours to tens of hours.
Codes
Year 2004
Year 2010
LARED-H
serial
Parallel
2-D radiation
hydrodynamics Lagrange
code
Single bolck
Multiblock
Without capsule
NIF ignition target
LARED-R
Serial
Parallel (2048 cores)
2-D radiation transport
code
Scale up MPI
a factor of
1000
Parallel
(32768 cores)
Single level
SAMR
3-D radiation
hydrodynamics Euler code
2-D: single group
Multi-group diffusion
3-D: no radiation
3-D: radiaiton multigroup
diffusion
LARED-P
MPI
Parallel (36000 cores)
Terascale of particles
LARED-S
3-D laser plasma interaction
code
GPU programming support and
performance optimization
Contact: Prof. Xiaoshe Dong, Xi’an Jiaotong University
Email: [email protected]
GPU program optimization

Three approaches for GPU
program optimization
memory-access level
 kernel-speedup level
 data-partition level

Approaches for GPU program
optimization
Memory-access Level
Kernel-speedup Level
Data-partition Level
Source-to-source translation for GPU

Developed a source-to-source
translator, GPU-S2S, for GPU

Facilitate the development of
parallel programs on GPU by
combining automatic mapping and
static compilation
Source-to-source translation for GPU


Insert directives into the source program

guide implicit calling of CUDA runtime libraries

enable the user to control the mapping of the computeintensive applications from the homogeneous CPU
platform to GPU’s streaming platform
Optimization based on runtime profiling

take full advantage of GPU according to the
characteristics of applications by collecting runtime
dynamic information.
The GPU-S2S architecture
PGAS programming model
MPI message transfer model
Layer of
software
productivity
Pthread thread model
GPU-S2S
Profile
information
GPU
supporting library
Calling shared
library
User standard
library
Layer of
Running-time performance collection performance
discover
Operating system
GPU platform
Program translation by GPU-S2S
homogeneous Computing
Templates library of
platform code function called by
Profile
optimized computing
homogeneos
with
libray
intensive applications
platform code
directives
User defined part
Calling shared libary
Source code before translation (homogeneous platform program framework)
GPU-S2S
Kernel
program of
GPU
according
templates
General
Templates library of
purpose
Profile
optimized
computing
computing
library
intensive
applications
interface
Calling shared libary
User standard library
Source code after translation (GPU streaming architecture platform
program framework)
Control
program
of CPU
Runtime optimization based on profiling
First level
profiling
(function level)
GPU-S2S
*.c、*.h
homogeneous
platform
code
C
language
compiler
Pretreatment
First level dynamic
instrumentation
Second level dynamic
instrumentation
Automatically
inserting directives
Second level profiling
(memory access and
kernel improvement )
Third level
profiling (data
partition)
Generate CUDA code
containing optimized
kernel
Compile
and run
Compile
and run
Extract profile
information:
computing kernel
First
Level
Profile
Extract profile information:
Data block size, Share memory
configuration parameters,
Judge whether can use stream
Second
Level
Profile
Don’t
need to
optimize
further
Termination
Need to optimize
further
Third level dynamic
instrumentation in
CUDA code
Compile
and run
Extract profile information:
Number of stream, Data size
of every stream
*.o
Executable
code on
GPU
CUDA
Compiler
tool
Generate
CUDA code
using
stream
*.h、
*.cu、
*.c
CUDA code
Third
Level
Profile
First level profiling
Homogeneous platform code
Allocate address
space
initialization
function0
Source-tosource compiler
instrumentation0
instrumentation0
instrumentatio1
function1
...
functionN
Free address space
instrumentation1
instrumentationN
instrumentationN

Scan source code before
translation, find function
and insert instrumentation
before and after the
function, compute
execution time of every
function, and find
computing kernel finally.
Second level profiling

Homogeneous platform
code
Computing
kernel1
Computing
kernel2
instrumentation

instrumentation
...
...
Computing
kernel3
Source-to-source
compiler
instrumentation

GPU-S2S scans code, insert
instrumentation in the
corresponding place of
computing kernel
extract profile information,
analyze the code, perform some
optimization, according to the
feature of application to expand
the templates, finally generate
the CUDA code with optimized
kernel
Using share memory is an
general approach, containing 13
parameters, having different
performance with different
values.
Third level profiling
CUDA control code
Allocate address
space
initialization
Allocate global
address space
function0--copyin
function0--kernel
Source-to-source
compiler
instrumentationi
instrumentationi
instrumentationk
instrumentationk
function0--copyout
...
Free address space
instrumentationo
instrumentationo

GPU-S2S scans code, find
computing kernel and its
copy function, insert
instrumentation into the
corresponding place of
code, get copy time and
computing time. according
to the time to compute the
number of streams and
data size of each stream.
finally generate the
optimized CUDA code with
stream.
Verification and experiment


Experiment platform:
 server:4-core Xeon CPU with 12GB
memory ,NVIDIA Tesla C1060
 Redhat enterprise server version 5.3
operation system
 CUDA version 2.3
Test example:
 Matrix multiplication
 Fast Fourier
transform (FFT)
800
time(ms)
600
only using
global
memory
500
400
300
second level
profile
optimization
200
100
0
1024
2048
different size of input data
third level
profile
optimization
time(ms)
memory
access
optimization
700
50000
45000
40000
35000
30000
25000
20000
15000
10000
5000
0
memory
access
optimization
only using
global
memory
second level
profile
optimization
4096
8192
different size of input data
third level
profile
optimization
Matrix multiplication Performance comparison before and after profile
time(ms)
The CUDA code with three
10000000
8000000
three level
profile
optimization
CPU
6000000
4000000
2000000
0
level profiling optimization
achieves 31% improvement
over the CUDA code with only
memory access optimization,
1024 2048 4096 8192
different size of input
data
Execution performance comparison on different platform
and 91% improvement over
the CUDA code using only
global memory for computing .
1800
memory access
optimization
1600
t i m e (m s )
1400
1200
second level
profile
optimization
third level
profile
optimization
only using
global memory
1000
800
600
400
200
0
15
30
45
The CUDA code after
three level profile
optimization achieves
60
number of Batch
38% improvement over
FFT(1048576 points) Performance comparison before and after profile
the CUDA code with
memory access
50000
optimization, and 77%
time(ms)
40000
three level
profile
optimization
CPU
30000
20000
improvement over the
CUDA code using only
global memory for
10000
computing .
0
15
30
45
60
different size of input data
FFT(1048576 points ) execution performance comparison on different platform
Programming Multi-GPU system

The traditional programming models, MPI and PGAS, are
not directly suitable for the new CPU+GPU platform.


The legacy applications cannot exploit the power of GPUs.
Programming model for CPU-GPU architecture

Combining the traditional programming model and GPU-specific
programming model, forming a mixed programming model.

Better performance on the CPU-GPU architecture, making more
efficient use of the computing power.
CPU
GPU
…
CPU
……
GPU
GPU
…
GPU
Programming Multi-GPU system
The memory of the CPU+GPU system are both distributed and
shared. So it is feasible to use MPI and PGAS programming
model for this new kind of system.
CPU
MainMem
CPU
Private space
Message
data
Main
Mem
Share space
Share data
Device
Mem
Device
Mem
Device
Mem
Device
Mem
GPU
GPU
GPU
GPU
MPI
PGAS
Using message passing or shared data for communication
between parallel tasks or GPUs
Mixed Programming Model
NVIDIA GPU
—— CUDA
Traditional Programming
model
—— MPI/UPC
MPI+CUDA/UPC+CUDA
Program start
Host
CPU
GPU
Device choosing
Program initial
Main MM
MPI/UPC runtime
Device
CPU
Source data
copy in
CPU
Main MM
(communication
interface of upper
programing model)
CPU
CUDA runtime
Device MM
GPU
Computing
start call
Communication
between tasks
Parallel
Task
Computing
kernel
GPU
GPU
computing
CUDA program execution
Result data
copy out
Device MM
CPU
CPU
CPU
end
cudaMemCopy
Mixed Programming Model

The primary control of an application is implemented by MPI or UPC
programming model. The computing kernels of the application are
implemented by CUDA, using GPU to accelerate computing.

Optimizing the computing kernel , make better use of GPUs.

Using GPU-S2S to generate the computing kernel program, hiding the
CPU+GPU heterogeneity to use, improving the portability of application.
Primary
control
program
<include>
Compiled with
mpicc/upcc
Declaration
of computing
kernel
Link with Nvcc
<include>
Computing
kernel
program
Compiled with
nvcc
Compiling process
Run with
mpirun/upcrun
MPI+CUDA experiment

Platform






2NF5588 server, equipped with
 1 Xeon CPU (2.27GHz), 12GB MM
 2 NVIDIA Tesla C1060 GPU(GT200 architecture,4GB
deviceMM)
1Gbt Ethernet
RedHatLinux5.3
CUDA Toolkit 2.3 and CUDA SDK
OpenMPI 1.3
BerkeleyUPC 2.1
MPI+CUDA experiment (cont’)

Matrix Multiplication program




Using block matrix multiply for UPC programming.
Data spread on each UPC thread.
The computing kernel carries out the multiply of two
blocks at one time, using CUDA to implement.
The total time of execution:
Tsum=Tcom+Tcuda=Tcom+Tcopy+Tkernel
Tcom: UPC thread communication time
Tcuda: CUDA program execution time
Tcopy: Data transmission time between host and device
Tkernel: GPU computing time
MPI+CUDA experiment (cont’)
2server,8 MPI task most
1 server with 2 GPUs

For 4094*4096,the speedup of 1 MPI+CUDA task ( using 1 GPU for computing) is
184x of the case with 8 MPI task.

For small scale data,such as 256,512 , the execution time of using 2 GPUs is
even longer than using 1 GPUs
the computing scale is too small , the communication between two tasks
overwhelm the reduction of computing time.
MPI+CUDA experiment (cont’)
Matrix size:8192*8192
Tcuda reduced as the task
number increase, but the Tsum
of 4 tasks is larger than that of 2.
Matrix size:16384*16384
Reason:the latency of Ethernet between
2 servers is much higher than the latency
on the Bus inside one server 。
If the computing scale is larger or using
faster network (e.g. Infiniband), Multinode with multi-GPUs will still improve the
performance of application.
Programming Support and Compilers
Contact: Prof. Xiaobing Feng, ICT, CAS, Beijing
[email protected]
Advanced Compiler Technology
(ACT) Group at the ICT, CAS


Institute of Computing Technology (ICT) is
founded at 1956, the first and leading institute
on computing technology in China
ACT is founded in early 1960’s, and has over 40
years experiences on compilers



Compilers for most of the mainframes developed in
China
Compiler and binary translation tools for Loogson
proessors
Parallel compilers and tools for the Dawning series
(SMP/MPP/cluster)
Advanced Compiler Technology
(ACT) Group at the ICT, CAS

ACT’s Current research


Parallel programming languages and models
Optimized compilers and tools for HPC (Dawning)
and multi-core processors (Loongson)
Advanced Compiler Technology
(ACT) Group at the ICT, CAS
• PTA model (Process-based TAsk parallel programming model )
– new process-based task construct
• With properties of isolation, atomicity and deterministic submission
– Annotate a loop into two parts, prologue and task segment
#pragma pta parallel [clauses]
#pragma pta task
#pragma pta propagate (varlist)
– Suitable for expressing coarse-grained, irregular parallelism on loops
• Implementation and performance
– PTA compiler, runtime system and assistant tool (help writing correct
programs)
– Speedup: 4.62 to 43.98 (average 27.58 on 48 cores); 3.08 to 7.83
(average 6.72 on 8 cores)
– Code changes is within 10 lines, much smaller than OpenMP
UPC-H : A Parallel Programming Model
for Deep Parallel Hierarchies

Hierarchical UPC


Provide multi-level data distribution
Implicit and explicit hierarchical loop parallelism



Hybrid execution model: SPMD with fork-join
Multi-dimensional data distribution and super-pipelining
Implementations on CUDA clusters and Dawning
6000 cluster

Based on Berkeley UPC




Enhance optimizations as localization and communication
optimization
Support SIMD intrinsics
CUDA cluster:72% of hand-tuned version’s
performance, code reduction to 68%
Multi-core cluster: better process mapping and cache
reuse than UPC
OpenMP and Runtime Support for
Heterogeneous Platforms

Heterogeneous platforms consisting of CPUs and GPUs



OpenMP extension



Specify partitioning ratio to optimize data transfer globally
Specify heterogeneous blocking sizes to reduce false sharing
among computing devices
Runtime support



Multiple GPUs, or CPU-GPU cooperation brings extra data transfer
hurting the performance gain
Programmers need unified data management system
DSM system based on the blocking size specified
Intelligent runtime prefetching with the help of compiler analysis
Implementation and results


On OpenUH compiler
Gains 1.6X speedup through prefetching on NPB/SP (class C)
Analyzers based on Compiling
Techniques for MPI programs

Communication slicing and process mapping tool

Compiler part



Optimized mapping tool



Weighted graph, Hardware characteristic
Graph partitioning and feedback-based evaluation
Memory bandwidth measuring tool for MPI programs


PDG Graph Building and slicing generation
Iteration Set Transformation for approximation
Detect the burst of bandwidth requirements
Enhance the performance of MPI error checking



Redundant error checking removal by dynamically turning on/off
the global error checking
With the help of compiler analysis on communicators
Integrated with a model checking tool (ISP) and a runtime
checking tool (MARMOT)
LoongCC: An Optimizing Compiler for
Loongson Multicore Processors

Based on Open64-4.2 and supporting C/C++/Fortran


Powerful optimizer and analyzer with better performances






Open source at http://svn.open64.net/svnroot/open64/trunk/
SIMD intrinsic support
Memory locality optimization
Data layout optimization
Data prefetching
Load/store grouping for 128-bit memory access instructions
Integrated with Aggressive Auto Parallelization
Optimization (AAPO) module



Dynamic privatization
Parallel model with dynamic alias optimization
Array reduction optimization
DigitalBridge: An Binary Translation System
for Loongson Multicore Processors

Fully utilizing hardware characters of Loongson
CPUs








Handle return instructions by shadow stack
Handle Eflag operations by flag pattern
Emulate X86 FPU by local FP registers
Combination of static and dynamic translation
Handle indirect-jumping table
Handle misalignment data accesses by dynamic
profile and exception handler
Improve data locality by pool allocation
Stack variables promotion
Software Tools for High
Performance Computing
Contact: Prof. Yi Liu, JSI, Beihang University
[email protected]
LSP3AS: large-scale parallel program
performance analysis system
Source Code
– Designed for performance
tuning on peta-scale HPC
systems
– Method is common:
•
•
•
Source code is instrumented
by inserting specified functioncalls
Instrumented code is
executed, while performance
data are collected, generating
profiling&tracing data files
The profiling&tracing data is
analyzed and visualization
report is generated
– Instrumentation: based on TAU
from University of Oregon
Dynamic Compensation
TAU Instrumentation
Measurement API
RDMA Transmission and
Buffer Management
Instrumented Code
Compiler/Linker
External Libraries
RDMA Library
Executable Datafile
Environment
Clustering Analysis
Based on Iteration
Performance Datafile
Clustering Visualization
Based on hierarchy
classify
Profiling Tools
Visualization and Analysis
Traditional Process of
performance analysis
Tracing Tools
Analysis based on
hierarchical clustering
Dependency of Each Step
Innovations
LSP3AS: large-scale parallel
program performance analysis
system


≈ 10 thousands of nodes in Petascale system, massive
performance data will be
generated, transmitted and stored
Scalable structure for
performance data collection



Distributed data collection and
transmission: eliminate
bottlenecks in network and data
processing
Dynamic Compensation algorithm:
reduce the influence of
performance data volume
Efficient Data Transmission: use
Remote Direct Memory Access
(RDMA) to achieve high
bandwidth and low latency
Storage system
FC
FC
RD
M
Compute node
IO node
IO node
Lustre Client
Or GFS
Lustre Client
Or GFS
Thread
Thread
Receiver
Receiver
RD
A
……
M
A
RD
M
RD
A
……
M
A
Compute node
Compute node
Sender
Sender
Sender
Sender
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
User
process
User
process
User
process
User
process
User
process
User
process
Compute node
User
process
User
process
LSP3AS: large-scale parallel
program performance analysis
system
• Analysis & Visualization
– Two approaches to deal with huge
amount of data
•
•
Data Analysis: Iteration-based clustering
approach from data mining technology are
used
Visualization: Clustering visualization
Based on Hierarchy Classification
SimHPC: Parallel Simulator

Challenge for HPC Simulation: performance


Target system: >1,000 nodes and processors
Difficult for traditional architecture simulators


e.g. Simics
Our solution

Parallel simulation


Use same node in host system with the target




Using cluster to simulate cluster
Basis: HPC systems uses commercial processors, even blades  also
available for simulator
Execution time of instruction sequences are the same in host & target

Processes makes things a little complicated, we will discuss it later
Advantage: no need to model and simulate detailed components, such as
pipeline in processors and cache
Execution-driven, Full-system simulation, support execution of Linux
and applications include benchmarks (e.g. Linpack)
SimHPC: Parallel Simulator (cont’)

Analysis

Execution time of a process in target system is
composed of:
Tprocess  Trun  TIO  Tready
equal to host
can be obtained in Linux kernel
needed to be
simulated
unequal to host
needed to be re-calculated
− Trun: execution time of instruction sequences
− TIO: I/O blocking time, such as r/w files,
send/recv msgs
− Tready: waiting time in ready-state
So, Our simulator needs to:
①Capture system events
• process scheduling
• I/O operations: read/write files, MPI send()/recv()
②Simulate I/O and interconnection network subsystems
③Synchronize timing of each application process
SimHPC: Parallel Simulator (cont’)

System Architecture

Application processes of multiple target nodes are
allocated to one host node
number of host nodes << number of target nodes



Events are captured on host node while application
is running
Events are sent to central node to analyze,
synchronize time, and simulation
Host node
Host node
Event
Capture
Event
Capture
Host node
……
Event
Capture
Parallel applications
...
Target
Event Collection
Control
Process ... Process
Analysis &
Time-axis Sychronize
Simulation Results
...
Target
Target
Process ... Process
Process ... Process
Target
Process ... Process
Simulator
Simulator
……
Interconnection
Network
Host Linux
Disk I/O
Host Hardware Platform
Host Hardware Platform
Architecture Simulation
Host
Host
Host Linux
SimHPC: Parallel Simulator (cont’)
• Experiment Results
–
–
–
–
Host: 5 IBM Blade HS21 (2-way Xeon)
Target: 32 – 1024 nodes
OS: Linux
App: Linpack HPL
Simulation Slowdown
Simulation Error Test
Linpack performance for Fat-tree
and 2D-mesh Interconnection
Communication time for Fat-tree
and 2D-mesh Interconnection
System-level Power Management

Power-aware Job Scheduling
algorithm
Idea:
Suspend a node if its idle-time >
threshold
②Wakeup nodes if there is no enough
nodes to execute jobs, while
③Avoid node thrashing between busy
and suspend state

Since suspend & wakeup
operation can consume power

Do not wakeup a suspending
node if it just goes to sleep
①
The algorithm is integrated into
OpenPBS
System-level Power Management
• Power Management Tool
– Monitor the power-related status of the system
– Reduce runtime power consumption of the machine
– Multiple power management policies
–
–
–
–
Manual-control
On-demand control
Suspend-enable
…
Power Management Policies
Policy
Level
Power Management Software / Interfaces
Management
/Interface
Level
Power Management Agent in Node
Node
sleep/wakeup
Node
On/Off
CPU Freq.
control
Power
Fan speed
control of I/O
control
equipments
Layers of Power Management
Node Level
...
Control &
Monitor
System-level Power
Management
Commands
Status
Power data
Power
• Power Management Test
– On 5 IBM HS21 blades
Task Load
(tasks per
hour)
20
Power Mesurement
System
Comparison
Power
Management
Policy
Task Exec.
Time
(s)
Power
Consumption
(J)
Performance
slowdown
Power
Saving
On-demand
3.55
1778077
5.15%
-1.66%
Suspend
3.60
1632521
9.76%
-12.74%
On-demand
3.55
1831432
4.62%
-3.84%
Suspend
3.65
1683161
10.61%
-10.78%
On-demand
3.55
2132947
3.55%
-7.05%
Suspend
3.66
2123577
11.25%
-9.34%
200
800
Power management test for different Task Load
(Compared to no power management)
Parallel Programming Platform for
Astrophysics
Contact: Yunquan Zhang, ISCAS, Beijing
[email protected]
Parallel Computing Software
Platform for Astrophysics

Joint work





Shanghai Astronomical Observatory, CAS (SHAO),
Institute of Software, CAS (ISCAS)
Shanghai Supercomputer Center (SSC)
Build a high performance parallel computing software
platform for astrophysics research, focusing on the
planetary fluid dynamics and N-body problems
New parallel computing models and parallel algorithms
studied, validated and adopted to achieve high
performance.
Software Architecture
Web Portal on CNGrid
Software Platform for Astrophysics
Data Processing
Scientific Visualiztion
Physical and
Mathematical
Model
Numerical
Methods
Fluid Dynamics
PETSc
Aztec
Improved
Preconditioner
MPI
OpenMP
N-body Problem
FFTW
SpMV
Fortran
GSL
Improved Lib. for
Collective
Comunication
C
100T Supercomputer
Lustre
Parallel
Computing
Model
Software
Development
PETSc Optimized Version 1 (Speedup 4-6)
Mesh 80×80×50
(Dawning 5000A)
400
1600
1400
350
Aztec
Petsc
300
250
200
Mesh 160×160×100
(Dawning 5000A)
Aztec
PETSc
1200
1000
Runtime (s)

The PETSc optimized version1 for astrophysics numerical
simulation has been finished. The early performace evaluation
for Aztec code and PETSc code on Dawning 5000A is shown.
For 80×80×50 mesh, the execution time of Aztec program is
4-7 times of the PETSc version,
average 6 times; For
160×160×100 mesh, the execution time of Aztec program is 25 times of the PETSc version, average 4 times.
Runtime (s)

800
150
600
100
400
50
200
0
16
32
64
128
256
512
1024 2048
Processor core
0
32
64
128
256
512
1024
Processor core
2048
PETSc Optimized Version 2 (Speedup 15-26)

Method 1: Domain Decomposition Ordering Method for Field
Coupling

Method 2: Preconditioner for Domain Decomposition Method

Method 3: PETSc Multi-physics Data Structure
Left: mesh 128 x 128 x 96
Right: mesh 192 x 192 x 128
Computation Speedup: 15-26
Strong scalability: Original code normal, New code ideal
Test environment: BlueGene/L at NCAR (HPCA2009)
Strong Scalability on Dawning 5000A
Dawning 5000A(160×160×100 mesh size)
1600
1400
Aztec
Petsc
Time(Seconds)
1200
1000
800
600
400
200
0
32
64
128
256
512
1024
2048
4096
8192
Processor Cores
2015/4/13
Strong Scalability
ro tm p line a r: 192x192x128
1000
BG /L
曙光5000A
4.7
14.4
12.0
26.1
19.2
13.5
51.1
32.3
23.8
98.5
69.3
65.5
212.8
157.7
144.8
433.6
257.1
344.7
Tim e (S)
10
8.3
深腾7000
100
1
64
128
256
512
1024
2048
4096
8192
num b e r o f p ro c e sso r c o re
65
Strong Scalability on TianHe-1A
2015/4/13
CLeXML Math Library
Task
Parallel
Self Adaptive
Tunning
Multi-core
parallel
Iterative Solver
LAPACK
BLAS
Computationa
l Model
FFT
CPU
Self Adaptive
Tunning,
Instruction
Reordering,
Software
Pipelining…
BLAS2 Performance: MKL vs.
CLeXML
BLAS3 Performance: MKL vs.
CLeXML
FFT Performance: MKL vs. CLeXML
FFT Performance: MKL vs. CLeXML
HPC Software support for Earth
System Modeling
Contact: Prof. Guangwen Yang, Tsinghua University
[email protected]
72
Development
Wizard and
Editor
Source Code
Compiler/
Debugger/
Optimizer
Other Data
Executable
Algorithm
(Parallel)
Initial Field and
Boundary Condition
Running
Environment
Earth System Model
Development Workflow
Earth
System
Model
Computation
Output
Result Evaluation
Result Visualization
Data
Visualization
and Analysis
Tools
Data
Management
Subsystem
Standard
Data Set
73
Demonstrative Applications
Expected Results
research on
global change
model application
systems
development
tools:
data conversion
diagnosis
debugging
performance analysis
high availability
integrated high performance
computing environment for
earth system model
Existing tools:
compiler
system monitor
version control
editor
software
standards
international
resources
template library
module library
high performance computers in China
75
Integration and Management of
Massive Heterogeneous Data
 web based data access portal
 provide simplified APIs for locating model data path
 provide reliable meta-data management and support user-defined meta-data
 support DAP data access protocol and provide model data queries
 data processing service based on ‘cloud’ methods
 provide SQL-Like query interface for climate model semantics
 support parallel data aggregation and extraction
 support online and offline conversion between different data formats
 support graphic workflow operations
 data storage service on parallel file system
 provide fast and reliable parallel I/O for climate modeling
 support compressed storage for earth scientific data
Technical Route
Presentation
Layer
Shell Command Line
Eclipse Client
Web Browser
API(C & Fortran)
Request Parsing Engine
Support
Layer
transf
er
query
publish
share
visualiz
ation
aggreg
ation
extrac
tion
data storage
service
conver
sion
read
write
archive
Hadoop
MPI
OpenDAP
GPU CUDA SDK
Data Grid Middleware
HDF5
pNetCDF
PIO
Key-Value Storage System
Memory File System
Compressed Archive File
System
Parallel File System PVFS2
toolset
Storage
Layer
browse
data processing
service
service
data access service
interface
Web Service(Rest & SOAP)
Fast Visualization and Diagnosis of
Earth System Model Data
Research topics
 design and implementation of parallel visualization algorithms
 design parallel volume rendering algorithms that can scale to hundreds
of cores, and achieve efficient data sampling and composition
 design parallel contour surface algorithm to achieve quick extraction and
composition of contour surface
 performance optimization for TB-scale data field visualization
 software acceleration for graphics and image
 hardware acceleration for graphics and image
 visualized representation methods for earth system models
78
computing
node
computing
node
computing
node
computing
node
network
HPC
high-resolution renderer &
display wall
high-speed
internal bus
OpenGL stream
parallel visualization
engine library
pixel stream
(giga-bps)
graphical
node
PVE launcher
meta-data
manager
preprocessed
data
graphical
workstation
DMX
Chromium
local user
data
processor
netCDF, NC
raw data
(TB)
web
remote user
MPMD Program Debugging,
and Analysis
 MPMD parallel program debugging
 MPMD parallel program performance measurement and
analysis
 support to efficient execution of MPMD parallel programs
 fault-tolerance technologies for MPMD parallel programs
MPMD Program Debugging, Analysis Environment
Runtime
Support
High
Availability
Debugging
Performance
Analysis
Basic Hardware/Software Environment
Technical Route
Presentation
Layer
IDE
Integration
Framework
Shell Command Line
Eclipse client
browser
job management UI
debug plug-in
performance
analysis plug-in
Abstraction
Layer
abstraction service
parallel debugging
Service
Layer
query
instru
mentat
ion
group
track
job and
resource
management
INT
Fundamental
Support
perfor
mance
analys
is
data
collec
tion
analys
is
management
middleware
Operation System
reliability
performance analysis
resource
management
File System
data
repres
entati
on
system
monitor
job control
Language
Environment
Hardware(nodes and network)
contro
ller
plug-in
and
command
job
scheduling
Library
Technical Route
Debugging and Optimization IDE for Earth System Model
Programs
Debugging
Performance
Window
Analysis Window
system failure notification
and fault-tolerant scheduling
Earth System Model Abstraction Service Platform
resource management and
job scheduling
performance
optimization service
debugging service
scheduling
performance
sampling data
debugging
monitoring
event
collection
debugging
replay
reliable
monitoring
hierarchical
scheduling
system
failure
notification
program event
collection
earth system model MPMD program
system execution environment
Integrated Development
Environment (IDE)
 a plug-in-based expandable development platform
 a template-based development supporting environment
 a tool library for earth system model development
 typical earth system model applications developed using the
integrated development environment
Plug-in integration method
Another
Tool
Eclipse Platform
Java
Development
Tools
(JDT)
Workbench
JFace
SWT
Plug-in
Development
Environment
(PDE)
Help
Team
Workspace
Debug
Platform Runtime
Eclipse Project
Your
Tool
Their
Tool
Encapsulation of reusable modules
Radiation Module
Module
encapsulation
specification
…
…
Module units:
High
performance
reusable
Time Integration Module
Solver Module
Boundary
Layer Module
Coupler Module
Model module
lib
Thank You!