SuperNetworking - IEEE Computer Society

Transcript SuperNetworking - IEEE Computer Society

Re-Configurable
EXASCALE Computing
Steve Wallach
swallach”at”conveycomputer “dot”com
Convey <= Convex++
Acknowledgement
• Steve Poole (ORNL) and I had many emails
and discussion on technology trends.
• We are both guilty of being overzealous
architects
swallach – 2012 APRIL - IEEE _DALLAS
Discussion
• The state of HPC software – today
• The path to Exascale Computing
• "If you don't know where you're going, any
road will take you there,”
– The Wizard of Oz.
• “When you get to the fork in the road take
it”
– Yogi Berra
swallach – 2012 APRIL - IEEE _DALLAS
Mythology
• In the old days, we
were told
“Beware of Greeks
bearing gifts”
swallach – 2012 APRIL - IEEE _DALLAS
Today
• “Beware of Geeks bearing Gifts”
swallach – 2012 APRIL - IEEE _DALLAS
What problems are we solving
• New Hardware
Paradigms
• Uniprocessor
Performance leveling
• MPP and multithreading for the
masses
• Power Consumption is
a constraint
swallach – 2012 APRIL - IEEE _DALLAS
Deja Vu
•
–
–
•
Many Core
ILP fizzles
x86 extended with sse2, sse3, and sse4
–
–
•
application specific enhancements
Co-processor within x86 microarchitecture
Basically performance enhancements
by
–
–
On chip parallel
Instructions for specific application
acceleration
•
•
One application instruction replaces
MANY generic instructions
Déjà vu – all over again – 1980’s
–
–
Need more performance than micro
GPU, CELL, and FPGA’s
•
•
Yogi Berra
Multi-Core Evolves
Different software environment
Heterogeneous Computing AGAIN
swallach – 2012 APRIL - IEEE _DALLAS
Current Languages
• Fortran 66  Fortran 77 
Fortran 95  2003
– HPC Fortran
– Co-Array Fortran
• C  C++
–
–
–
–
UPC
Stream C
C# (Microsoft)
Ct (Intel)
• OpenMP
– Shared memory
multiprocessing API
swallach – 2012 APRIL - IEEE _DALLAS
Another Bump in the Road
• GPGPU’s are very cost
effective for many
applications.
• Matrix Multiply
– Fortran
do i = 1,n1
do k = 1,n3
c(i,k) = 0.0
do j = 1,n2
c(i,k) = c(i,k) + a(i,j) * b(j,k)
Enddo
Enddo
Enddo
swallach – 2012 APRIL - IEEE _DALLAS
PGI Fortran to CUDA
__global__ void
matmulKernel( float* C, float* A, float* B, int N2, int N3 ){
int bx = blockIdx.x, by = blockIdx.y;
int tx = threadIdx.x, ty = threadIdx.y;
int aFirst = 16 * by * N2;
int bFirst = 16 * bx;
float Csub = 0;
for( int j = 0; j < N2; j += 16 ) {
__shared__ float Atile[16][16], Btile[16][16];
Atile[ty][tx] = A[aFirst + j + N2 * ty + tx];
Btile[ty][tx] = B[bFirst + j*N3 + b + N3 * ty + tx];
__syncthreads();
for( int k = 0; k < 16; ++k )
Csub += Atile[ty][k] * Btile[k][tx];
__syncthreads();
}
int c = N3 * 16 * by + 16 * bx;
C[c + N3 * ty + tx] = Csub;
}
void
matmul( float* A, float* B, float* C,
size_t N1, size_t N2, size_t N3 ){
void *devA, *devB, *devC;
cudaSetDevice(0);
cudaMalloc( &devA, N1*N2*sizeof(float) );
cudaMalloc( &devB, N2*N3*sizeof(float) );
cudaMalloc( &devC, N1*N3*sizeof(float) );
cudaMemcpy( devA, A, N1*N2*sizeof(float), cudaMemcpyHostToDevice );
cudaMemcpy( devB, B, N2*N3*sizeof(float), cudaMemcpyHostToDevice );
dim3 threads( 16, 16 );
dim3 grid( N1 / threads.x, N3 / threads.y);
matmulKernel<<< grid, threads >>>( devC, devA, devB, N2, N3 );
cudaMemcpy( C, devC, N1*N3*sizeof(float), cudaMemcpyDeviceToHost );
cudaFree( devA );
cudaFree( devB );
cudaFree( devC );
}
Pornographic Programming:
Can’t define it, but you know
When you see it.
http://www.linuxjournal.com/article/10216
Michael Wolfe – Portland Group
How We Should Program GPGPUs November 1st, 2008
swallach – 2012 APRIL - IEEE _DALLAS
Find the Accelerator
•
Accelerators can be beneficial. It
isn’t “free” (like waiting for the next
clock speed boost)
•
•
•
Worst case - you will have to
completely rethink your
algorithms and/or data
structures
Performance tuning is still time
consuming
Don’t forget our long history of
parallel computing...
Fall Creek Falls Conference, Chattanooga, TN - Sept. 2009
One of These Things Isn't Like the Other...Now What?Pat McCormick, LANL
swallach – 2012 APRIL - IEEE _DALLAS
Programming Model
•
•
Tradeoff programmer productivity
vs. performance
Web programming is mostly done
with scripting and interpretive
languages
– Java
– Javascript
•
•
Server-side programming
languages (Python, Ruby, etc.).
Matlab Users tradeoff productivity
for performance
– Moore’s Law helps performance
– Moore’s Law hurts productivity
• Multi-core
•
What languages are being used
– http://www.tiobe.com/index.php/c
ontent/paperinfo/tpci/index.html
swallach – 2012 APRIL - IEEE _DALLAS
12
NO
Kathy Yelick, 2008 Keynote, Salishan Conference
swallach – 2012 APRIL - IEEE _DALLAS
Exascale Workshop Dec 2009, San Diego
swallach – 2012 APRIL - IEEE _DALLAS
Exascale Specs
JULY 2011
swallach – 2012 APRIL - IEEE _DALLAS
Berkeley’s 13 Motifs
http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-23.html
swallach – 2012 APRIL - IEEE _DALLAS
What does this all mean?
How do we create
architectures to do
this?
Each application
is a different
point on this 3D
grid (actually a
curve)
Application Performance Metric
(e.g. Efficiency)
High
Cache-based
Stride-1 Physical
Stride-N Physical
Compiler Complexity
Stride-N Smart
Low
SISD
SIMD
MIMD/
Threads
Full Custom
swallach – 2012 APRIL - IEEE _DALLAS
17
Convey – ISA (Compiler View)
User Written
Systolic/State
Machine
Bio-Informatics
VECTOR
(32 Bit – Float)
SEISMIC
x86-64 ISA
Data Mining
Sorting/Tree
Traversal
Bit/Logical
swallach – 2012 APRIL - IEEE _DALLAS
Hybrid-Core Computing
Application-Specific Personalities
Applications
• Extend the x86 instruction set
Convey Compilers
• Implement key operations in
hardware
Life Sciences
x86-64 ISA
Custom ISA
CAE
Custom
Financial
Oil & Gas
Shared Virtual Memory
1
1
19
Cache-coherent, shared memory
• Both ISAs address common memory
swallach – 2012 APRIL - IEEE _DALLAS
HC-1 Hardware
Coprocessor Assembly
Host memory DIMMs
x16 PCI-E slot
FSB Mezzanine Card
• 2U enclosure:
– Top half of 2U platform
contains the coprocessor
– Bottom half contains
Intel motherboard
1/22/2010
20
Host x86
Server Assembly
swallach – 2012 APRIL - IEEE _DALLAS
3 x 3½” Disk Drives
Memory Subsystem
• Optimized for 64-bit accesses; 80 GB/sec peak
• Automatically maintains coherency without impacting AE performance
1/22/2010
21
swallach – 2012 APRIL - IEEE _DALLAS
Personality Development Kit
(PDK)
1/22/2010
22
HIB &
IAS
Host
Interface
AE-AE interface
Dispatch Interface
Customer Designed Logic
Scalar Instr.
Exception
handling
MC Link Interface
Memory Controllers
Virtually addressable,
cache coherent memory
swallach – 2012 APRIL - IEEE _DALLAS
Mgmt/Debug Interface
• Customer designed
logic in Convey
infrastructure
• Executes as
instructions within
an x86-64 address
space
• Allows designers
to concentrate on
Intellectual
Property, not
housekeeping
CP
Mgmt
Pers.
Loading
Pers.
Debug
HW
Monitor
(IPMI)
C/C++
Current Convey Compiler
Architecture
Fortran95
Front End
WHIRL
Interprocedural Analyzer
x86 Loop
Nest
Optimizer
Convey
Vectorizer
Global Scalar Optimizer
(woptimizer)
Intel 64
Codegen
tree-based
intermediate
representation
Personality
specific
optimizations
built into
vectorizer
Personality
Signature
Convey
Instruction Table
AE FPGA bit file
Convey
Codegen
loaded at runtime
object files to
linker
January 20, 2010
23
swallach – 2012 APRIL - IEEE _DALLAS
Convey Confidential
Next Generation Optimizer
C/C++
Fortran95
Personality
Front End
Signature
Interprocedural Analyzer
x86 Loop
Nest
Optimizer
Convey
Optimizer
directed acyclic
graph intermediate
representation
rules specify
operations on graph
Global Scalar Optimizer
Intel 64
Codegen
Convey
Codegen
Instruction
Descriptors
AE FPGA bit file
table based code
generation
loaded at runtime
object files to
linker
January 20, 2010
Graph
Transformation
Rules
24
l
swallach – 2012 APRIL - IEEE _DALLAS
personality
specific
optimizations
coded as rules
custom
instruction
definitions in
descriptor files
for 64-bit memory access
• High bandwidth for non-unity
strides, scatter/gather
• 3131 interleave for high
bandwidth for all strides except
multiples of 31
• measured with strid3.c
written by Mark Seager:
Strided 64-bit Memory Accesses
60
60
50
50
40
GB/sec
GB/sec
•
Strided Memory Access
Performance
Convey SG-DIMMs optimized
Strided 64-bit Memory Accesses
40
30
30
20
20
10
10
1/22/2010
25
Stride (1-64 words)
Nehalem (single core)
Nehalem (single core)
HC-1 SG-DIMM 3131
HC-1 SG-DIMM 3131
swallach – 2012 APRIL - IEEE _DALLAS
61 61
57 57
53 53
49 49
Stride (1-64 words)
45 45
41 41
37 37
33 33
29 29
25 25
21 21
9
9
17 17
5
5
13 13
1
for( i=0; i<n*incx; i+=incx ) {
yy[i] += t*xx[i];
}
0
1
0
Performance/$
Performance/Watt
Rank
Machine
Owner
Scale
TEPS
$K
Perf/
KW
$
Perf/W
1
Intrepid
(IBM BlueGene/P, 8192 nodes/32k cores)
Argonne National Laboratory
36
6.600 $10,400.00
2
Franklin
(Cray XT4, 500 of 9544 nodes)
NERSC
32
5.220
$2,000.00 2610
80
65
3
cougarxmt
(128 node Cray XMT)
Pacific Northwest National
Laboratory
29
1.220
$1,000.00 1220 100
12
4
graphstorm
(128 node Cray XMT)
Sandia National Laboratories
29
1.170
$1,000.00 1170 100
12
5
Endeavor
(256 node, 512 core Westmere X5670 2.93)
Intel Corporation
29
0.533
$2,000.00
6
Erdos
(64 node Cray XMT)
Oak Ridge National Laboratory
29
0.051
$500.00
7
Red Sky
(Nehalem X5570, IB Torus, 512 processors)
Sandia National Laboratories
28
0.478
Jaguar
(Cray XT5-HE, 512 node subset)
Oak Ridge National Laboratory
27
0.800
Convey
(Single node HC-1ex)
Convey Computer Corporation
27
0.773
Intel Corporation
26
0.616
8
9 Endeavor (128 node, 256 core Westmere X5670)
635 136
49
267 88.3
6
50
1
$1,500.00
319 88.3
5
$2,700.00
296
80
10
$80.00 9663 0.75
1031
$1,000.00
Adapted from http://www.graph500.org/Results.html prior to June 22, 2011
swallach – 2012 APRIL - IEEE _DALLAS
102
616 88.3
7
November 2011 Graph 500 Benchmark (BFS)
(Problem size 31 and lower)
Rank System
20 Dingus (Convey HC-1ex - 1 node / 4 cores, 4 FPGAs)
22 Vortex (Convey HC-1ex - 1 node / 4 cores, 4 FPGAs)
Convey01 (Convey HC-1ex - 1 node / 4 cores, 4
23 FPGAs)
24 HC1-d (Convey HC-1 - 1 node / 4 cores, 4 FPGAs)
25 Convey2 (Convey HC-1 - 1 node / 4 cores, 4 FPGAs)
7 IBM BlueGene/Q (512 nodes)
40 ultraviolet (4 processors / 32 cores)
9 SGI Altix ICE 8400EX (256 nodes / 1024 cores)
26 IBM and ScaleMP/vSMP Fundation (128 cores)
32 IBM and ScaleMP/vSMP Fundation (128 cores)
14 DAS-4/VU (128 processors)
34 Hyperion (64 nodes / 512 cores)
43 Kraken (6128 cores)
35 Matterhorn (64 nodes)
19 Todi (176 AMD Interlagos, 176 NVIDIA Tesla X2090)
42 Knot (64 cores / 8 processors)
49 Gordon (7 nodes / 84 cores)
33 Jaguar (224,256 processors)
Site
SNL
Convey Computer
Bielefeld University, CeBiTec
Convey Computer
LBL/NERSC
IBM Research, T.J. Watson
SNL
SGI
LANL
LANL
VU University
LLNL
LLNL
CSCS
CSCS
UCSB
SDSC
ORNL
swallach – 2012 APRIL - IEEE _DALLAS
MTEPS/
Scale MTEPS
kW
28 1,759
2,345
28 1,675
2,233
28 1,615
28 1,601
28 1,598
30 56,523
29
420
31 14,085
29 1,551
30 1,036
30 7,135
31 1,004
31
105
31
885
28 3,060
30
177
29
30
30 1,011
2,153
2,135
2,130
1,809
420
363
242
162
139
39
21
18
17
9
3
0
Graph – Multi-Threaded
swallach – 2012 APRIL - IEEE _DALLAS
Multi-Threaded Processor
De Bruijn graph
swallach – 2012 APRIL - IEEE _DALLAS
Now What
• A Strawman Exascale
System
Feasible by 2020 (delivered)
• Philosophy/Assumptions
behind approach
– Clock cycle constant
– Memory, Power and Network
bandwidth is DEAR
– Efficiency for delivered ops is
more important than peak
– If we think we know the
application profile for Exascale
applications we are kidding
ourselves
•
Then a counterproposal
– Contradict myself
swallach – 2012 APRIL - IEEE _DALLAS
Take an Operation Research Method
of Prediction
• Moore’s Law
• Application Specific
– Matrices
– Matrix Arithmetic
• Hard IP for floating point
• Number of Arithmetic Engines
• System Arch and Programming
model
– Language directed design
• Resulting Analysis
– Benefit
– Mean and +/- sigma (if
normal)
swallach – 2012 APRIL - IEEE _DALLAS
Moore’s Law
• Feature Set
– Every 2 years twice the
logic
– Thus by 2020
• 8 times the logic, same
clock rate
– Mean Factor of 7, sigma
+/- 2
• Benefit
– 7 times the performance,
same clock rate, same
internal architecture
swallach – 2012 APRIL - IEEE _DALLAS
Rock’s Law
•
Rock's law, named for Arthur Rock,
says that the cost of a semiconductor
chip fabrication plant doubles every
four years. As of 2011, the price had
already reached about 8 billion US
dollars.
•
Rock's Law can be seen as the economic
flipside to Moore's Law; the latter is a direct
consequence of the ongoing growth of the
capital-intensive semiconductor industry—
innovative and popular products mean more
profits, meaning more capital available to
invest in ever higher levels of large-scale
integration, which in turn leads to creation of
even more innovative products.
http://newsroom.intel.com/community/en_za/blog/2010/10/19/intel-announces-multi-billion-dollar-investment-innext-generation-manufacturing-in-us
http://en.wikipedia.org/wiki/Rock%27s_law
swallach – 2012 APRIL - IEEE _DALLAS
Application Specific
•
Feature Set
•
•
•
June 1993 Linpack (% of peak)
–
–
•
Tianhe-1A – 53% (86,016 cores & GPU)
French Bull – 83% (17,480 sockets. 140k cores)
June 2011 Linpack
–
•
Jaguar - 75% (224,162 cores & GPU)
November 2010 Linpack
–
–
•
Earth Simulator ( NEC) – 87.5% (5120)
June 2010 Linpack
–
•
NEC SX3/44 - 95% (4)
Cray YMP (16) – 90% (9)
June 2006
–
•
Matrix Engine
Mean Factor of 4 (+4/- 1)
Riken K – 93% (548,352 cores – vector)
Benefit
–
–
–
90% of Peak
Matrix Arithmetic's
Outer Loop Parallel/Inner Loop vector within
ONE functional unit
swallach – 2012 APRIL - IEEE _DALLAS
3D Finite Difference (3DFD)
Personality
• designed for nearest neighbor
operations on structured grids
– maximizes data reuse
• reconfigurable “registers”
– 1D (vector), 2D, and 3D
modes
– 8192 elements in a register
X(I,J,K) =
+
+
+
+
+
+
S0*Y(I ,J ,K )
S1*Y(I-1,J ,K )
S2*Y(I+1,J ,K )
S3*Y(I ,J-1,K )
S4*Y(I ,J+1,K )
S5*Y(I ,J ,K-1)
S6*Y(I ,J ,K+1)
• operations on entire cubes
– “add points to their neighbor to
the left times a scalar” is a
single instruction
– up to 7 points away in any
direction
• finite difference method for
post-stack reverse-time
migration
swallach – 2012 APRIL - IEEE _DALLAS
35
FPGA -Hard IP Floating Point
• Feature Set
– By 2020
– Fused Multiply Add –
Mean Factor of 8 +/- 2
– Reconfigurable IP Multiply
and Add – Mean Factor 4
+/-1
– Bigger fixed point DSP’s –
Mean Factor of 3
• Benefit
– More Floating Point ALU’s
– More routing paths
swallach
swallach
– 2012– APRIL
SEPT 2011
- IEEE
- UCLA
_DALLAS
36
Number of AE FPGA’s per node
• Feature Set
– Extensible AE’s
– 4, 8, or 16 AE’s as a
function of physical
memory capacity
• 4 -One byte per flop –
mean factor of 1
• 8 - ½ byte per flop –
mean factor of 2
• 16 -¼ byte per flop – mean
factor of 4
• Benefit
– More Internal Parallelism
– Transparent to user
– Potential heterogeneity
within node
swallach – 2012 APRIL - IEEE _DALLAS
Exascale Programming Model
Example 4.1-1: Matrix by Vector Multiply
1: #include<upc_relaxed.h>
2: #define N 200*THREADS
3: shared [N] double A[N][N];
NOTE: Thread is 16000
4: shared double b[N], x[N];
5: void main()
6: {
7: int i,j;
8: /* reading the elements of matrix A and the
9: vector x and initializing the vector b to zeros
10: */
11:
upc_forall(i=0;i<N;i++;i)
12:
for(j=0;j<N;j++)
13:
b[i]+=A[i][j]*x[j] ;
14: }
swallach – 2012 APRIL - IEEE _DALLAS
Math results in
• Using the mean
– 7 Moore’s law
– 4 Matrix arithmetic's
– .90 efficiency (percentage
of peak)
– 8 Fused multiply/add (64
bits)
– 4 16_ AE’s (user visible
pipelines) per node
– Or a MEAN of 800 times
today
• Best Case – 2304
• Worst Case - 448
swallach – 2012 APRIL - IEEE _DALLAS
The System
•
•
2010 base level node is 80 GFlops/node
(peak)
Thus 7 x 4 x .9 x 8 x 4 = 806 Factor
–
–
–
–
•
Mean of 800 = +1500 (upside)/- 400
(downside)
64 TFlops/node (peak)
16,000 Nodes/Exascale Linpack
High Speed Serial DRAM
64 bit virtual address space
–
–
–
Flat address space
UPC/openMP addressing paradigm integrated
within TLB hardware
Programming model is 16000 shared memory
nodes
•
•
Compiler optimizations (user transparent) deal
with local node micro-architecture
Power is 1.5 KWatts/Node (3-4U rack
mounted)
–
–
–
24 MegaWatts/system
32 TBytes/Node (288 PetaBytes – system)
Physical Memory approx 60% of power
swallach – 2012 APRIL - IEEE _DALLAS
INTEGRATED SMP - WDM
DRAM – 16/32 TeraBytes - HIGHLY INTERLEAVED
MULTI-LAMBDA
xmit/receive
CROSS BAR
6.4 TBYTES/SEC
.1 bytes/sec per
Peak flop
...
swallach – 2012 APRIL - IEEE _DALLAS
IBM Optical Technology
http://domino.research.ibm.com/comm/research_projects.nsf/pages/photonics.index.html
swallach – 2012 APRIL - IEEE _DALLAS
COTS ExaOPs/Flop/s System
3
2
4
5
Single Node
16 AE’s
...
1000
1
16000
14000
...
2000
ALL-OPTICAL
SWITCH
4000
12000
8000
7000
10 meters= 50 NS Delay
3000
...
I/O
LAN/WAN
... 5000
6000
swallach – 2012 APRIL - IEEE _DALLAS
Concluding (almost)
• Uniprocessor Performance has
to be increased
– Heterogeneous here to stay
– The easiest to program will be
the correct technology
• Smarter Memory Systems
(PIM)
• New HPC Software must be
developed.
– SMARTER COMPILERS
– ARCHITECTURE
TRANSPARENT
• New algorithms, not necessarily
new languages
swallach – 2012 APRIL - IEEE _DALLAS
CONVEZ COMPUTER
• DESIGN SPACE –
IPHONE 6.0S
– User Interface
– Processor
– External
Communications
2009, Sept, “CONVEZ COMPUTER”, Annapolis, MD
http://www.lps.umd.edu/AdvancedComputing/
swallach – 2012 APRIL - IEEE _DALLAS
Processor - 2020
•
Power Budget
– 300 Milliwatts
•
Either ARM or IA-64 compatible
• In 2020 all other ISA’s either will
be gone or niche players (no
volume)
•
•
•
•
•
64 bit virtual address space
(maybe)
2 to 4 GBytes Physical Memory
FPGA co-processor (hybrid
processor on a die)
1 Terabyte Flash
10 to 20 GFlops (real *8)
– 3D video drives DSP performance
• 2**24 chips
–
–
–
–
Approx ExaOP
Approx 64 PetaBytes – DRAM
Approx 2.4 MegaWatts
Switch is Cell Towers
swallach – 2012 APRIL - IEEE _DALLAS
Virtual Address Space
•
64 bits is NOT enough
–
•
•
Exa is 2**60
Applications reference the network not
just local memory (disk)
Time to embed/contemplate the
network in the address space
–
Unique Name
•
IPv6 (unicast host.id)
•
Global phone number
–
•
MAC address
128 bit virtual address space
–
–
–
NO NATS
JUST LIKE world-wide phone
numbering system
JUST LIKE IPv6
– US04656579 04/07/1987
–
Digital data processing system
having a uniquely organized
memory system and means for
storing and accessing
information therein
swallach – 2012 APRIL - IEEE _DALLAS
Finally
The Flops/Ops
OPs
system which is the simplest to
FLOPs
/Watt
program will win. USER cycles are more
important that CPU cycles
swallach – 2012 MARCH DOD
48