High Performance Computing Lecture 1

Transcript High Performance Computing Lecture 1

Parallel Scientific Computing:
Algorithms and Tools
Lecture #1
APMA 2821A, Spring 2008
Instructors: George Em Karniadakis
Leopold Grinberg
1
Logistics
Contact:
Office hours: GK: M 2-4 pm; LG: W 2-4 pm
Email: {gk,lgrinb}@dam.brown.edu
Web: www.cfm.brown.edu/people/gk/APMA2821A
 Textbook:
 Karniadakis & Kirby, “Parallel scientific computing in C++/MPI”
 Other books:
 Shonkwiler & Lefton, “Parallel and Vector Scientific Computing”
 Wadleigh & Crawford, “Software Optimization for High Performance
Computing”
 Foster, “Designing and Building Parallel Programs” (available online)
2
Logistics
CCV Accounts
Email: [email protected]
Prerequisite: C/Fortran programming
Grading:
5 assignments/mini-projects: 50%
1 Final project/presentation : 50%
3
History
4
History
5
Course Objectives
 Understanding of fundamental concepts
and programming principles for
development of high performance
applications
Able to program a range of parallel
computers: PC  clusters  supercomputers
 Make efficient use of high performance
parallel computing in your own research
6
Course Objectives
7
Content Overview
Parallel computer architecture: 2-3 weeks
CPU, Memory; Shared-/distributed-memory parallel
machines; network connections;
Parallel programming: 5 weeks
MPI; OpenMP; UPC
Parallel numerical algorithms: 4 weeks
Matrix algorithms; direct/iterative solvers;
eigensolvers; Monte Carlo methods (simulated
annealing, genetic algorithms)
Grid computing: 1 week
Globus, MPICH-G2
8
What & Why
 What is high performance computing (HPC)?
 The use of the most efficient algorithms on computers capable of
the highest performance to solve the most demanding problems.
 Why HPC?
 Large problems – spatially/temporally
• 10,000 x 10,000 x 10,000 grid  10^12 grid points  4x10^12
double variables  32x10^12 bytes = 32 Tera-Bytes.
• Usually need to simulate tens of millions of time steps.
• On-demand/urgent computing; real-time computing;
 Weather forecasting; protein folding; turbulence
simulations/CFD; aerospace structures; Full-body simulation/
Digital human …
9
HPC Examples: Blood Flow in
Human Vascular Network
 Cardiovascular disease accounts for
about 50% of deaths in western world;
 Formation of arterial disease strongly
correlated to blood flow patterns;
In one minute, the heart pumps the
entire blood supply of 5 quarts
through 60,000 miles of vessels, that
is a quarter of the distance between
the moon and the earth
Blood flow involves multiple scales
Computational challenges:
Enormous problem size
10
HPC Examples
Earthquake simulation
Surface velocity 75 sec after
earthquake
Flu pandemic simulation
300 million people tracked
Density of infected population,
45 days after breakout
11
HPC Example: Homogeneous Turbulence
Zoom-in
Zoom-in
Vorticity isosurface
Direct Numerical Simulation of Homogeneous Turbulence: 4096^3
12
How HPC fits into Scientific Computing
Air flow around
an airplane
Navier-stokes
equations
Algorithms, BCs,
solvers,
Application codes,
supercomputers
Viz software
Physical Processes
Mathematical Models
Numerical Solutions
Data Visualization,
Validation,
Physical insight
HPC
13
Performance Metrics
FLOPS, or FLOP/S: FLoating-point Operations Per
Second
MFLOPS: MegaFLOPS, 10^6 flops
GFLOPS: GigaFLOPS, 10^9 flops, home PC
TFLOPS: TeraGLOPS, 10^12 flops, present-day
supercomputers (www.top500.org)
PFLOPS: PetaFLOPS, 10^15 flops, by 2011
EFLOPS: ExaFLOPS, 10^18 flops, by 2020
MIPS=Mega Instructions per Second = MegaHertz (if only 1IPS)
Note: von Neumann computer -- 0.00083 MIPS
14
Performance Metrics
Theoretical peak performance R_theor:
maximum FLOPS a machine can reach in
theory.
 Clock_rate*no_cpus*no_FPU/CPU
 3GHz, 2 cpus, 1 FPU/CPU  R_theor=3x10^9 * 2 =
6 GFLOPS
Real performance R_real: FLOPS for specific
operations, e.g. vector multiplication
Sustained performance R_sustained:
performance on an application, e.g. CFD
R_sustained << R_real << R_theor
Not uncommon
R_sustained < 10%R_theor
15
Top 10 Supercomputers
www.top500.org
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
November 2007, LINPACK performance
Site ManufacturerComputer Country
Year
Processors RMax
RPeak
DOE/NNSA/LLNL
IBM
eServer Blue
United
GeneStates
Solution2007
212992
478200
596378
Forschungszentrum
IBM
Juelich
Blue Gene/P
(FZJ)Germany
Solution
2007
65536
167300
222822
SGI/New Mexico
SGI Computing
SGI Altix
Applications
ICE
United
8200,
States
Center
Xeon quad
(NMCAC)
2007
core 3.0
14336
GHz 126900
172032
Computational
Hewlett-Packard
Research
Cluster
Laboratories,
Platform
India 3000
TATABL460c,
SONS
2007
Xeon 53xx
14240
3GHz,
117900
Infiniband
170880
Government
Hewlett-Packard
Agency Cluster Platform
Sweden
3000 BL460c,
2007
Xeon 53xx
13728
2.66GHz,
102800
Infiniband
146430
NNSA/Sandia
Cray
National
Inc. Sandia/
Laboratories
Cray
United
Red States
Storm, Opteron
2007 2.426569
GHz dual102200
core
127531
Oak Ridge Cray
National
Inc. Laboratory
Cray XT4/XT3
United States
2006
23016
101700
119350
IBM Thomas
IBM
J. Watson
eServer
Research
Blue
United
Center
GeneStates
Solution2005
40960
91290
114688
NERSC/LBNL
Cray Inc. Cray XT4, 2.6
United
GHzStates
2007
19320
85368
100464
Stony Brook/BNL,
IBM
NeweServer
York Center
Blue
United
Gene
for Computational
States
Solution2007Sciences
36864
82161
103219
DOE/NNSA/LLNL
IBM
eServer pSeries
Unitedp5
States
575 1.9 2006
GHz
12208
75760
92781
RensselaerIBM
Polytechnic
eServer
Institute,
Blue
Computational
United
GeneStates
Solution
Center
2007 for Nanotechnology
32768
73032
Innovations
91750
Barcelona Supercomputing
IBM
BladeCenter
Center
Spain
JS21 Cluster, PPC
2006970, 10240
2.3 GHz, Myrinet
63830
94208
NCSA
Dell
PowerEdgeUnited
1955, States
2.33 GHz,
2007
Infiniband9600
62680
89587.2
Leibniz Rechenzentrum
SGI
Altix 4700 1.6
Germany
GHz
2007
9728
56520
62259.2
GSIC Center,
NEC/Sun
Tokyo Institute
Sun Fire
of x4600
Technology
Japan
Cluster, Opteron
2007 2.4/2.6
11664
GHz and
56430
ClearSpeed
102021
Accelera
University of
Cray
Edinburgh
Inc. Cray XT4, 2.8
United
GHzKingdom 2007
11328
54648
63436.8
NNSA/Sandia
DellNationalPowerEdge
Laboratories
United
1850, States
3.6 GHz, 2006
Infiniband 9024
53000
64972.8
Commissariat
BullaSA
l'Energie
NovaScale
Atomique
France
5160,
(CEA)
Itanium2 1.6
2006
GHz, Quadrics
9968
52840
63795.2
NASA/Ames
SGI
ResearchSGI
Center/NAS
Altix 1.5
United
GHz,States
Voltaire Infiniband
2004
10160
51870
60960
R_real
R_theor16
Number of Processors
17
Fastest
Supercomputers
www.top500.org
My Laptop
At present
Japanese Earth Simulator
Projections
18
A Growth-Factor of a Billion
in Performance in a Career
Super Scalar/Vector/Parallel
1 PFlop/s
(1015)
IBM
BG/L
Parallel
ASCI Red
1 TFlop/s
(1012)
ASCI White
Pacific
TMC CM-5 Cray T3D
Vector
2X Transistors/Chip
Every 1.5 Years
1 GFlop/s
Cray 2
Cray X-MP
Super Scalar
(109)
TMC CM-2
Cray 1
CDC 7600
1 MFlop/sScalar
(106)
IBM 360/195
CDC 6600
IBM 7090
1 KFlop/s
(103)
UNIVAC 1
EDSAC 1
1950
1960
1970
1980
1941
1945
1949
1951
1961
1964
1968
1975
1987
1992
1993
1997
2000
2005
1 (Floating Point operations / second, Flop/s)
100
1,000 (1 KiloFlop/s, KFlop/s)
10,000
100,000
1,000,000 (1 MegaFlop/s, MFlop/s)
10,000,000
100,000,000
1,000,000,000 (1 GigaFlop/s, GFlop/s)
10,000,000,000
100,000,000,000
1,000,000,000,000 (1 TeraFlop/s, TFlop/s)
10,000,000,000,000
131,000,000,000,000 (131 Tflop/s)
1990
2000
2010
Japanese “Life Simulator” Effort for a
10 Pflop/s System
 From the Nikkei newspaper, May
30th morning edition.
 Collaboration of industry, academia
and government is organized by
NEC, Hitachi, U of Tokyo, Kyusyu U,
and RIKEN.
 Competition component similar to
the DARPA HPCS program.
 This year allocated about $4 M each
to do advanced development
towards petascale.
 Total of ¥100,000 M ($909 M) will be
invested in this development.
 Plan to be operational in 2011.
Japan’s Life Simulator:
Original concept design in 2005
Needs of Multiscale Multiphysic simulation
Present
Needs of
multiple
computation
components
Switch
Slower connection
Faster
Faster
Faster
interconnect interconnect interconnect
Vector
Node
Scalar
Node
MD
Node
Integration of
multiple
architecture
Tightly-coupled
heterogeneous computer
Proposing
Faster
architecture interconnect
Vector Scalar
Node Node
MD
Node
FPGA Node
Major Applications of Next Generation Supercomputer
Targeted as grand
challenges
Basic Concept for Simulations in Nano-Science
Basic Concept for Simulations in Life Sciences
Meso
Organism
Organ
http://info.med.
vale.edu/
Micro-machine
Tissue
Structure
Blood
Circulation
Vascular
System
Macro
Multi-physics
HIFU
Tissue
Cell
http://ridge.icu.ac.jp
Bio-MD
Chemical
Process
Protein
Genome
Gene Therapy
Genes
Micro
Catheter
RIKEN
RIKEN
DDS
Petascale Era: 2008-
•NCSA: Blue Waters 1PTF/s, 2011
25
Bell versus Moore
26
Grand Challenge Applications
27
The von Neumann Computer
Walk-Through: c=a+b
1. Get next instruction
2. Decode: Fetch a
3. Fetch a to internal register
4. Get next instruction
5. Decode: fetch b
6. Fetch b to internal register
7. Get next instruction
8. Decode: add a and b (c in register)
9. Do the addition in ALU
10. Get next instruction
11. Decode: store c in main memory
12. Move c from internal register to main memory
Note: Some units are idle while others are working…waste of cycles.
Pipelining (modularization) & Cashing (advance decoding)…parallelism
28
Basic Architecture
-CPU, pipelining
-Memory hierarchy,
cache
29
Computer Performance
CPU operates on data. If no data, CPU has to
wait; performance degrades.
 typical workstation: 3.2GHz CPU, Memory 667MHz.
Memory 5 times slower.
Moore’s law: CPU speed doubles every 18 months
Memory speed increases much much slower;
Fast CPU requires sufficiently fast memory.
Rule of thumb: Memory size in GB=R_theor in
GFLOPS
1CPU cycle (1 FLOPS) handles 1 byte of data
1MFLOPS needs 1MB of data/memory
1GFLOPS needs 1GB of data/memory
Many “tricks” designed for performance improvement targets the memory
30
CPU Performance
Computer time is measured in terms of CPU
cycles
Minimum time to execute 1 instruction is 1 CPU cycle
Time to execute a given program:
nc
T  nc  tc  ni   tc  ni  CPI  tc
ni
n_c: total number of CPU cycles
n_i: total number of instructions
CPI = n_c/n_i, average cycles per instruction
t_c: cycle time, 1GHz  t_c=1/(10^9Hz) = 10^(-9)sec = 1ns
31
To Make a Program/Computer Faster…
 Reduce cycle time t_c:
 Increase clock frequency; however, there is a physical limit
 In 1ns, light travels 30cm
 Currently ~ GHz; 3GHz cpu  light travels 10cm within 1 cpu
cycle  length/size must be < 10cm.
 1 atom about 0.2 nm;
 Reduce number of instructions n_i:
 More efficient algorithms
 Better compilers
 Reduce CPI -- The key is parallelism.
 Instruction-level parallelism. Pipelining technology
 Internal parallelism, multiple functional units; superscalar
processors; multi-core processors
 External parallelism, multiple CPUs, parallel machine
32
Processor Types
Vector processor;
Cray X1/T90; NEC SX#; Japan Earth Simulator; Early
Cray machines; Japan Life Simulator (hybrid)
Scalar processor
CISC: Complex Instruction Set Computer
• Intel 80x86 (IA32)
RISC: Reduced Instruction Set Computer
• Sun SPARC, IBM Power #, SGI MIPS
VLIW: Very Long Instruction Word; Explicitly parallel
instruction computing (EPIC); Probably dying
• Intel IA64 (Itanium)
33
CISC Processor
CISC
Complex instructions; Large number of
instructions; Can complete more complicated
functions at instruction level
Instruction actually invokes microcode.
Microcodes are small programs in processor
memory
Slower; Many instructions access memory;
varying instruction length; allow no pipelining;
34
RISC Processor
No microcode
Simple instructions; Fewer instructions;
Fast
Only load and store instructions access
memory
Common instruction word length
Allows pipelining
Almost all present-day high performance computers use
RISC processors
35
Locality of References
Spatial/Temporal locality
If processor executes an instruction at time t,
it is likely to execute an adjacent/next
instruction at (t+delta_t);
If processor accesses a memory location/data
item x at time t, it is likely to access an
adjacent memory location/data item
(x+delta_x) at (t+delta_t);
Pipelining, Caching and many other techniques all
based on the locality of references
36
Pipelining
Overlapping execution of multiple instructions
1 instruction per cycle
Sub-divide instruction into multiple stages;
Processor handles different stages of adjacent
instructions simultaneously
Suppose 4 stages in instruction:
Instruction fetch and decode (IF)
Read data (RD)
Execute (EX)
Write-back results (WB)
37
Instruction Pipeline
instruction
1
IF
2
3
4
5
6
7
cycle
1
RD
IF
2
EX
RD
IF
3
WB
EX
RD
IF
4
WB
EX
RD
IF
5
WB
EX
RD
IF
6
WB
EX
RD
IF
WB
EX
RD
WB
EX
7
8
9
WB
10
Depth of pipeline: number of stages in an instruction
After the pipeline is full, 1 result per cycle! CPI = (n+depth-1)/n
With pipeline, 7 instructions take 10 cycles. If no pipeline, 7 instructions take 28
38
cycles
Inhibitors of Pipelining
Dependencies between instructions
interrupts pipelining, degrading
performance
Control dependence.
Data dependence.
39
Control Dependence
Branching: when an instruction occurs after an
conditional branch; so it is unknown whether that
instruction will be executed beforehand
Loop: for(i=0;i<n;i++)…; do…enddo
Jump: goto …
Condition: if…else…
if(x>y) n=5;
Branching in programs interrupts pipeline  degrades performance
Avoid excessive branching!
40
Data Dependence
when an instruction depends on data from a
previous instruction
x = 3*j;
y = x+5.0; // depends on previous instruction
41
Vector Pipeline
Vector processors: with vector registers which
can hold a vector, e.g. of 128 elements;
Commonly encountered processors are scalar
processors, e.g. home PC
Efficient for loops involving vectors.
Instructions:
for (i=0;i<128;i++)
z[i] = x[i] + y[i]
Vector Load X(1:128)
Vector Load Y(1:128)
Vector Add Z=X+Y
Vector Store Z
42
Vector Pipeline
instruction
Load X(1:128)
IF
IF
Load Y(1:128)
Add Z=X+Y
Store Z
cycle
RD
X(1)
RD
IF
… X(128)
Y(1)
… Y(128)
AD
IF
1
2
3
…
Z(1)
ST
… Z(128)
Z(1)
… Z(128)
133
time
43
Vector Operations: Hockney’s Formulas
CACHE: 64 Kb
44
Exceeding Cache Size
CACHE: 32 Kb
Cache line: 64 bytes
NOTE: Asymptotic 5Mflops: result every 15 clocks –
time to reload a cache line following a miss
45
Internal Parallelism
Functional units:
components in
processor that
actually do the work
Memory operations
(MU): load, store;
Integer arithmetic (IU):
integer add, bit shift …
Floating point
arithmetic (FPU):
floating-point add,
multiply …
Typical instruction latencies
Instruction
type
Latency
(cycles)
Integer add
1
Floating-point
add
3
Floating-point
multiply
3
Floating-point
divide
31
Division is much slower than add/multiply! Minimize
or avoid divisions!
46
Internal Parallelism
Superscalar RISC processors: multiple
functional units in processor, e.g. multiple
FPUs,
Capable of executing more than one
instruction (producing more than one result)
per cycle.
Shared registers, L1 cache etc.
Need faster memory access to provide
data to multiple functional units!
Limiting factor: memory-processor
bandwidth
47
Internal Parallelism
CPU chip
 Multi-core processors: Intel
dual-core, quad-core
 Multiple execution cores
(functional units, registers, L1
cache)
 Multiple cores share L2 cache,
memory
 Lower energy consumption
 Need FAST memory access
to provide data to multiple
cores
 Effective memory bandwidth per
core is reduced
 Limiting factor: memoryprocessor bandwidth
Functional units + L1 cache
Shared L2 cache
Between cores
48
Heat Flux also Increases with Speed!
49
New Processors are Too Hot!
~
~
~
50
51
Your Next PC?
52
External Parallelism
Parallel machines: Will be discussed later
53
Memory: Next Lecture
 Bit: 0, 1; Byte: 8 bits
 Memory size
 PB – 10^15 bytes; TB – 10^12 bytes; GB – 10^9 bytes; MB –
10^6 bytes
 Memory performance measures:
 Access time, or response time, latency: interval between time of
issuance of memory request and time when request is satisfied.
 Cycle time: minimum time between two successive memory
requests
Access time: t1-t0
Cycle time: t2-t0
Memory busy t0 < t < t2
t0
Memory
request
t1
request
satisfied
t2
If there is another request
at t0 < t < t2, memory is
busy and will not respond;
have to wait until t > t2 54

High Performance Computing Lecture 1

Transcript High Performance Computing Lecture 1

Directory