High Performance Computing Lecture 1
Download
Report
Transcript High Performance Computing Lecture 1
Parallel Scientific Computing:
Algorithms and Tools
Lecture #1
APMA 2821A, Spring 2008
Instructors: George Em Karniadakis
Leopold Grinberg
1
Logistics
Contact:
Office hours: GK: M 2-4 pm; LG: W 2-4 pm
Email: {gk,lgrinb}@dam.brown.edu
Web: www.cfm.brown.edu/people/gk/APMA2821A
Textbook:
Karniadakis & Kirby, “Parallel scientific computing in C++/MPI”
Other books:
Shonkwiler & Lefton, “Parallel and Vector Scientific Computing”
Wadleigh & Crawford, “Software Optimization for High Performance
Computing”
Foster, “Designing and Building Parallel Programs” (available online)
2
Logistics
CCV Accounts
Email: [email protected]
Prerequisite: C/Fortran programming
Grading:
5 assignments/mini-projects: 50%
1 Final project/presentation : 50%
3
History
4
History
5
Course Objectives
Understanding of fundamental concepts
and programming principles for
development of high performance
applications
Able to program a range of parallel
computers: PC clusters supercomputers
Make efficient use of high performance
parallel computing in your own research
6
Course Objectives
7
Content Overview
Parallel computer architecture: 2-3 weeks
CPU, Memory; Shared-/distributed-memory parallel
machines; network connections;
Parallel programming: 5 weeks
MPI; OpenMP; UPC
Parallel numerical algorithms: 4 weeks
Matrix algorithms; direct/iterative solvers;
eigensolvers; Monte Carlo methods (simulated
annealing, genetic algorithms)
Grid computing: 1 week
Globus, MPICH-G2
8
What & Why
What is high performance computing (HPC)?
The use of the most efficient algorithms on computers capable of
the highest performance to solve the most demanding problems.
Why HPC?
Large problems – spatially/temporally
• 10,000 x 10,000 x 10,000 grid 10^12 grid points 4x10^12
double variables 32x10^12 bytes = 32 Tera-Bytes.
• Usually need to simulate tens of millions of time steps.
• On-demand/urgent computing; real-time computing;
Weather forecasting; protein folding; turbulence
simulations/CFD; aerospace structures; Full-body simulation/
Digital human …
9
HPC Examples: Blood Flow in
Human Vascular Network
Cardiovascular disease accounts for
about 50% of deaths in western world;
Formation of arterial disease strongly
correlated to blood flow patterns;
In one minute, the heart pumps the
entire blood supply of 5 quarts
through 60,000 miles of vessels, that
is a quarter of the distance between
the moon and the earth
Blood flow involves multiple scales
Computational challenges:
Enormous problem size
10
HPC Examples
Earthquake simulation
Surface velocity 75 sec after
earthquake
Flu pandemic simulation
300 million people tracked
Density of infected population,
45 days after breakout
11
HPC Example: Homogeneous Turbulence
Zoom-in
Zoom-in
Vorticity isosurface
Direct Numerical Simulation of Homogeneous Turbulence: 4096^3
12
How HPC fits into Scientific Computing
Air flow around
an airplane
Navier-stokes
equations
Algorithms, BCs,
solvers,
Application codes,
supercomputers
Viz software
Physical Processes
Mathematical Models
Numerical Solutions
Data Visualization,
Validation,
Physical insight
HPC
13
Performance Metrics
FLOPS, or FLOP/S: FLoating-point Operations Per
Second
MFLOPS: MegaFLOPS, 10^6 flops
GFLOPS: GigaFLOPS, 10^9 flops, home PC
TFLOPS: TeraGLOPS, 10^12 flops, present-day
supercomputers (www.top500.org)
PFLOPS: PetaFLOPS, 10^15 flops, by 2011
EFLOPS: ExaFLOPS, 10^18 flops, by 2020
MIPS=Mega Instructions per Second = MegaHertz (if only 1IPS)
Note: von Neumann computer -- 0.00083 MIPS
14
Performance Metrics
Theoretical peak performance R_theor:
maximum FLOPS a machine can reach in
theory.
Clock_rate*no_cpus*no_FPU/CPU
3GHz, 2 cpus, 1 FPU/CPU R_theor=3x10^9 * 2 =
6 GFLOPS
Real performance R_real: FLOPS for specific
operations, e.g. vector multiplication
Sustained performance R_sustained:
performance on an application, e.g. CFD
R_sustained << R_real << R_theor
Not uncommon
R_sustained < 10%R_theor
15
Top 10 Supercomputers
www.top500.org
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
November 2007, LINPACK performance
Site ManufacturerComputer Country
Year
Processors RMax
RPeak
DOE/NNSA/LLNL
IBM
eServer Blue
United
GeneStates
Solution2007
212992
478200
596378
Forschungszentrum
IBM
Juelich
Blue Gene/P
(FZJ)Germany
Solution
2007
65536
167300
222822
SGI/New Mexico
SGI Computing
SGI Altix
Applications
ICE
United
8200,
States
Center
Xeon quad
(NMCAC)
2007
core 3.0
14336
GHz 126900
172032
Computational
Hewlett-Packard
Research
Cluster
Laboratories,
Platform
India 3000
TATABL460c,
SONS
2007
Xeon 53xx
14240
3GHz,
117900
Infiniband
170880
Government
Hewlett-Packard
Agency Cluster Platform
Sweden
3000 BL460c,
2007
Xeon 53xx
13728
2.66GHz,
102800
Infiniband
146430
NNSA/Sandia
Cray
National
Inc. Sandia/
Laboratories
Cray
United
Red States
Storm, Opteron
2007 2.426569
GHz dual102200
core
127531
Oak Ridge Cray
National
Inc. Laboratory
Cray XT4/XT3
United States
2006
23016
101700
119350
IBM Thomas
IBM
J. Watson
eServer
Research
Blue
United
Center
GeneStates
Solution2005
40960
91290
114688
NERSC/LBNL
Cray Inc. Cray XT4, 2.6
United
GHzStates
2007
19320
85368
100464
Stony Brook/BNL,
IBM
NeweServer
York Center
Blue
United
Gene
for Computational
States
Solution2007Sciences
36864
82161
103219
DOE/NNSA/LLNL
IBM
eServer pSeries
Unitedp5
States
575 1.9 2006
GHz
12208
75760
92781
RensselaerIBM
Polytechnic
eServer
Institute,
Blue
Computational
United
GeneStates
Solution
Center
2007 for Nanotechnology
32768
73032
Innovations
91750
Barcelona Supercomputing
IBM
BladeCenter
Center
Spain
JS21 Cluster, PPC
2006970, 10240
2.3 GHz, Myrinet
63830
94208
NCSA
Dell
PowerEdgeUnited
1955, States
2.33 GHz,
2007
Infiniband9600
62680
89587.2
Leibniz Rechenzentrum
SGI
Altix 4700 1.6
Germany
GHz
2007
9728
56520
62259.2
GSIC Center,
NEC/Sun
Tokyo Institute
Sun Fire
of x4600
Technology
Japan
Cluster, Opteron
2007 2.4/2.6
11664
GHz and
56430
ClearSpeed
102021
Accelera
University of
Cray
Edinburgh
Inc. Cray XT4, 2.8
United
GHzKingdom 2007
11328
54648
63436.8
NNSA/Sandia
DellNationalPowerEdge
Laboratories
United
1850, States
3.6 GHz, 2006
Infiniband 9024
53000
64972.8
Commissariat
BullaSA
l'Energie
NovaScale
Atomique
France
5160,
(CEA)
Itanium2 1.6
2006
GHz, Quadrics
9968
52840
63795.2
NASA/Ames
SGI
ResearchSGI
Center/NAS
Altix 1.5
United
GHz,States
Voltaire Infiniband
2004
10160
51870
60960
R_real
R_theor16
Number of Processors
17
Fastest
Supercomputers
www.top500.org
My Laptop
At present
Japanese Earth Simulator
Projections
18
A Growth-Factor of a Billion
in Performance in a Career
Super Scalar/Vector/Parallel
1 PFlop/s
(1015)
IBM
BG/L
Parallel
ASCI Red
1 TFlop/s
(1012)
ASCI White
Pacific
TMC CM-5 Cray T3D
Vector
2X Transistors/Chip
Every 1.5 Years
1 GFlop/s
Cray 2
Cray X-MP
Super Scalar
(109)
TMC CM-2
Cray 1
CDC 7600
1 MFlop/sScalar
(106)
IBM 360/195
CDC 6600
IBM 7090
1 KFlop/s
(103)
UNIVAC 1
EDSAC 1
1950
1960
1970
1980
1941
1945
1949
1951
1961
1964
1968
1975
1987
1992
1993
1997
2000
2005
1 (Floating Point operations / second, Flop/s)
100
1,000 (1 KiloFlop/s, KFlop/s)
10,000
100,000
1,000,000 (1 MegaFlop/s, MFlop/s)
10,000,000
100,000,000
1,000,000,000 (1 GigaFlop/s, GFlop/s)
10,000,000,000
100,000,000,000
1,000,000,000,000 (1 TeraFlop/s, TFlop/s)
10,000,000,000,000
131,000,000,000,000 (131 Tflop/s)
1990
2000
2010
Japanese “Life Simulator” Effort for a
10 Pflop/s System
From the Nikkei newspaper, May
30th morning edition.
Collaboration of industry, academia
and government is organized by
NEC, Hitachi, U of Tokyo, Kyusyu U,
and RIKEN.
Competition component similar to
the DARPA HPCS program.
This year allocated about $4 M each
to do advanced development
towards petascale.
Total of ¥100,000 M ($909 M) will be
invested in this development.
Plan to be operational in 2011.
Japan’s Life Simulator:
Original concept design in 2005
Needs of Multiscale Multiphysic simulation
Present
Needs of
multiple
computation
components
Switch
Slower connection
Faster
Faster
Faster
interconnect interconnect interconnect
Vector
Node
Scalar
Node
MD
Node
Integration of
multiple
architecture
Tightly-coupled
heterogeneous computer
Proposing
Faster
architecture interconnect
Vector Scalar
Node Node
MD
Node
FPGA Node
Major Applications of Next Generation Supercomputer
Targeted as grand
challenges
Basic Concept for Simulations in Nano-Science
Basic Concept for Simulations in Life Sciences
Meso
Organism
Organ
http://info.med.
vale.edu/
Micro-machine
Tissue
Structure
Blood
Circulation
Vascular
System
Macro
Multi-physics
HIFU
Tissue
Cell
http://ridge.icu.ac.jp
Bio-MD
Chemical
Process
Protein
Genome
Gene Therapy
Genes
Micro
Catheter
RIKEN
RIKEN
DDS
Petascale Era: 2008-
•NCSA: Blue Waters 1PTF/s, 2011
25
Bell versus Moore
26
Grand Challenge Applications
27
The von Neumann Computer
Walk-Through: c=a+b
1. Get next instruction
2. Decode: Fetch a
3. Fetch a to internal register
4. Get next instruction
5. Decode: fetch b
6. Fetch b to internal register
7. Get next instruction
8. Decode: add a and b (c in register)
9. Do the addition in ALU
10. Get next instruction
11. Decode: store c in main memory
12. Move c from internal register to main memory
Note: Some units are idle while others are working…waste of cycles.
Pipelining (modularization) & Cashing (advance decoding)…parallelism
28
Basic Architecture
-CPU, pipelining
-Memory hierarchy,
cache
29
Computer Performance
CPU operates on data. If no data, CPU has to
wait; performance degrades.
typical workstation: 3.2GHz CPU, Memory 667MHz.
Memory 5 times slower.
Moore’s law: CPU speed doubles every 18 months
Memory speed increases much much slower;
Fast CPU requires sufficiently fast memory.
Rule of thumb: Memory size in GB=R_theor in
GFLOPS
1CPU cycle (1 FLOPS) handles 1 byte of data
1MFLOPS needs 1MB of data/memory
1GFLOPS needs 1GB of data/memory
Many “tricks” designed for performance improvement targets the memory
30
CPU Performance
Computer time is measured in terms of CPU
cycles
Minimum time to execute 1 instruction is 1 CPU cycle
Time to execute a given program:
nc
T nc tc ni tc ni CPI tc
ni
n_c: total number of CPU cycles
n_i: total number of instructions
CPI = n_c/n_i, average cycles per instruction
t_c: cycle time, 1GHz t_c=1/(10^9Hz) = 10^(-9)sec = 1ns
31
To Make a Program/Computer Faster…
Reduce cycle time t_c:
Increase clock frequency; however, there is a physical limit
In 1ns, light travels 30cm
Currently ~ GHz; 3GHz cpu light travels 10cm within 1 cpu
cycle length/size must be < 10cm.
1 atom about 0.2 nm;
Reduce number of instructions n_i:
More efficient algorithms
Better compilers
Reduce CPI -- The key is parallelism.
Instruction-level parallelism. Pipelining technology
Internal parallelism, multiple functional units; superscalar
processors; multi-core processors
External parallelism, multiple CPUs, parallel machine
32
Processor Types
Vector processor;
Cray X1/T90; NEC SX#; Japan Earth Simulator; Early
Cray machines; Japan Life Simulator (hybrid)
Scalar processor
CISC: Complex Instruction Set Computer
• Intel 80x86 (IA32)
RISC: Reduced Instruction Set Computer
• Sun SPARC, IBM Power #, SGI MIPS
VLIW: Very Long Instruction Word; Explicitly parallel
instruction computing (EPIC); Probably dying
• Intel IA64 (Itanium)
33
CISC Processor
CISC
Complex instructions; Large number of
instructions; Can complete more complicated
functions at instruction level
Instruction actually invokes microcode.
Microcodes are small programs in processor
memory
Slower; Many instructions access memory;
varying instruction length; allow no pipelining;
34
RISC Processor
No microcode
Simple instructions; Fewer instructions;
Fast
Only load and store instructions access
memory
Common instruction word length
Allows pipelining
Almost all present-day high performance computers use
RISC processors
35
Locality of References
Spatial/Temporal locality
If processor executes an instruction at time t,
it is likely to execute an adjacent/next
instruction at (t+delta_t);
If processor accesses a memory location/data
item x at time t, it is likely to access an
adjacent memory location/data item
(x+delta_x) at (t+delta_t);
Pipelining, Caching and many other techniques all
based on the locality of references
36
Pipelining
Overlapping execution of multiple instructions
1 instruction per cycle
Sub-divide instruction into multiple stages;
Processor handles different stages of adjacent
instructions simultaneously
Suppose 4 stages in instruction:
Instruction fetch and decode (IF)
Read data (RD)
Execute (EX)
Write-back results (WB)
37
Instruction Pipeline
instruction
1
IF
2
3
4
5
6
7
cycle
1
RD
IF
2
EX
RD
IF
3
WB
EX
RD
IF
4
WB
EX
RD
IF
5
WB
EX
RD
IF
6
WB
EX
RD
IF
WB
EX
RD
WB
EX
7
8
9
WB
10
Depth of pipeline: number of stages in an instruction
After the pipeline is full, 1 result per cycle! CPI = (n+depth-1)/n
With pipeline, 7 instructions take 10 cycles. If no pipeline, 7 instructions take 28
38
cycles
Inhibitors of Pipelining
Dependencies between instructions
interrupts pipelining, degrading
performance
Control dependence.
Data dependence.
39
Control Dependence
Branching: when an instruction occurs after an
conditional branch; so it is unknown whether that
instruction will be executed beforehand
Loop: for(i=0;i<n;i++)…; do…enddo
Jump: goto …
Condition: if…else…
if(x>y) n=5;
Branching in programs interrupts pipeline degrades performance
Avoid excessive branching!
40
Data Dependence
when an instruction depends on data from a
previous instruction
x = 3*j;
y = x+5.0; // depends on previous instruction
41
Vector Pipeline
Vector processors: with vector registers which
can hold a vector, e.g. of 128 elements;
Commonly encountered processors are scalar
processors, e.g. home PC
Efficient for loops involving vectors.
Instructions:
for (i=0;i<128;i++)
z[i] = x[i] + y[i]
Vector Load X(1:128)
Vector Load Y(1:128)
Vector Add Z=X+Y
Vector Store Z
42
Vector Pipeline
instruction
Load X(1:128)
IF
IF
Load Y(1:128)
Add Z=X+Y
Store Z
cycle
RD
X(1)
RD
IF
… X(128)
Y(1)
… Y(128)
AD
IF
1
2
3
…
Z(1)
ST
… Z(128)
Z(1)
… Z(128)
133
time
43
Vector Operations: Hockney’s Formulas
CACHE: 64 Kb
44
Exceeding Cache Size
CACHE: 32 Kb
Cache line: 64 bytes
NOTE: Asymptotic 5Mflops: result every 15 clocks –
time to reload a cache line following a miss
45
Internal Parallelism
Functional units:
components in
processor that
actually do the work
Memory operations
(MU): load, store;
Integer arithmetic (IU):
integer add, bit shift …
Floating point
arithmetic (FPU):
floating-point add,
multiply …
Typical instruction latencies
Instruction
type
Latency
(cycles)
Integer add
1
Floating-point
add
3
Floating-point
multiply
3
Floating-point
divide
31
Division is much slower than add/multiply! Minimize
or avoid divisions!
46
Internal Parallelism
Superscalar RISC processors: multiple
functional units in processor, e.g. multiple
FPUs,
Capable of executing more than one
instruction (producing more than one result)
per cycle.
Shared registers, L1 cache etc.
Need faster memory access to provide
data to multiple functional units!
Limiting factor: memory-processor
bandwidth
47
Internal Parallelism
CPU chip
Multi-core processors: Intel
dual-core, quad-core
Multiple execution cores
(functional units, registers, L1
cache)
Multiple cores share L2 cache,
memory
Lower energy consumption
Need FAST memory access
to provide data to multiple
cores
Effective memory bandwidth per
core is reduced
Limiting factor: memoryprocessor bandwidth
Functional units + L1 cache
Shared L2 cache
Between cores
48
Heat Flux also Increases with Speed!
49
New Processors are Too Hot!
~
~
~
50
51
Your Next PC?
52
External Parallelism
Parallel machines: Will be discussed later
53
Memory: Next Lecture
Bit: 0, 1; Byte: 8 bits
Memory size
PB – 10^15 bytes; TB – 10^12 bytes; GB – 10^9 bytes; MB –
10^6 bytes
Memory performance measures:
Access time, or response time, latency: interval between time of
issuance of memory request and time when request is satisfied.
Cycle time: minimum time between two successive memory
requests
Access time: t1-t0
Cycle time: t2-t0
Memory busy t0 < t < t2
t0
Memory
request
t1
request
satisfied
t2
If there is another request
at t0 < t < t2, memory is
busy and will not respond;
have to wait until t > t2 54