Coarse Grain Reconfigurable Architectures

Download Report

Transcript Coarse Grain Reconfigurable Architectures

International
Supercomputer
Conference
Dresden, Gemany, June 28 - 30, 2006
Reiner Hartenstein
Reconfigurable Supercomputing:
TU Kaiserslautern
Hurdles and Chances
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• Preface
• The von Neumann paradigm trap
• Supercomputing: the wrong Road Map
• The Solution ignored for decades
• Fine-grained vs. coarse-grained
• The wrong Road Map for CS Curricula
http://www.uni-kl.de
• Conclusions
© 2006, [email protected]
2
http://hartenstein.de
Preface
TU Kaiserslautern
My talk does not really cover the
performance of bulk storage, discs, etc.
My talk highlights
the supercomputing paradigm trap
and the fully ignored early solution
The talk illustrates why behind
the success of FPGAs there is
a hidden paradigm shift
© 2006, [email protected]
3
http://hartenstein.de
The Pervasiveness of
Reconfigurable Computing (RC)
FPGAs are used everywhere
Nov. 2005
TU Kaiserslautern
“FPGA and ….”
# of hits
by Google
# of hits
by Google
647,000
1,490,000
171,000
194,000
398,000
1,620,000
127,000
113,000
158,000
162,000
915,000
272,000
© 2006, [email protected]
4
http://hartenstein.de
TU Kaiserslautern
An Example: FPGAs in Oil and Gas ....
[Herb Riley, R. Associates]
„Application migration [from supercomputer]
has resulted in a 17-to-1 increase in performance"
Saves more than $10,000 in electricity bills
per year (7¢ / kWh) - .... per 64-processor 19" rack
did you know …
… 25% of Amsterdam‘s electric energy
consumption goes into server farms ?
… a quarter square-kilometer of office floor space
within New York City is occupied by server farms ?
© 2006, [email protected]
5
http://hartenstein.de
TU Kaiserslautern
Oil and Gas as a strategic issue
You know the amount of Google’ s electricity bill?
It should be investigated, how far the migrational
achievements obtained for computationally intensive
applications, can also be utilized for servers
© 2006, [email protected]
6
http://hartenstein.de
TU Kaiserslautern
15 GigaFLOPs on a single FPGA chip
Last night I met Stamatis Vassiliadis (TU Delft)
15 GigaFLOPs on single chip for matrix computations
A surprize: much less memory needed than expected
© 2006, [email protected]
7
http://hartenstein.de
some published speed-up factors
The RC
paradox
relative
performance
TU Kaiserslautern
109
DSP and wireless
Image processing,
Decoding
Pattern matching, real-time face Reed-Solomon
detection
2400
6000
crypto
Multimedia video-rate stereo visionMAC 1000
106
1000
although the effective
integration density of
FPGAs is by 4 orders
of magnitude 103
behind the Moore curve
400
pattern recognition 730
900 288
SPIHT wavelet-based image compression 457
Bioinformatics
1980
© 2006, [email protected]
100
52
FFT
protein identification BLAST
40
Pentium 4
20
wiring overhead
reconfigurability overhead
routing congestion
8080
100
Viterbi Decoding
Smith-Waterman
pattern matching
88 molecular dynamics simulation
GRAPE
Astrophysics
1990
2000
8
2010
http://hartenstein.de
TU Kaiserslautern
Educational Deficits
Educational deficits have stalled Reconfigurable
Computing (RC) as well as classical supercomputing
Transdisciplinary fragmentation: each application
domain uses its own trick boxes
Too many sophisticated very clever architectures
We need a fundamental model with a methodology
which all application domains have in common
Transdisciplinary education & basic research needed
© 2006, [email protected]
9
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• Preface
• The von Neumann paradigm trap
• Supercomputing: the wrong Road Map
• The Solution ignored for decades
• Fine-grained vs. coarse-grained
• The wrong Road Map for CS Curricula
http://www.uni-kl.de
• Conclusions
© 2006, [email protected]
10
http://hartenstein.de
The basic model paradigm trap
TU Kaiserslautern
High performance computing
stalled for decades by the
von Neuman paradigm trap
most systems
are extremely
unbalanced
For decades the right
roadmap was hidden by
another paradigm trap
stolen from Bob Colwell
© 2006, [email protected]
11
http://hartenstein.de
TU Kaiserslautern
Transdisciplinary Education?
Computer Science not prepared
Lacking intradisciplinary cohesion
between the mind sets of:
•Theoreticians (Math background)
•Hardware People
•Computer Architects
•Embedded Syst. Designers
•Software People (Application Development)
for decades: the Hardware / Software chasm
turns into the Configware / Software chasm
© 2006, [email protected]
12
http://hartenstein.de
TU Kaiserslautern
© 2006, [email protected]
migration of
the lemings
13
[David Padua, John Hennessy, et al.]
Flag ship conference series: IEEE ISCA
Jean-Loup Baer
http://hartenstein.de
The Dead Supercomputer Society
TU Kaiserslautern
•ACRI
•Alliant
•American
Supercomputer
•Ametek
•Applied Dynamics
•Astronautics
•BBN
•CDC
•Convex
•Cray Computer
•Cray Research
•Culler-Harris
•Culler Scientific
•Cydrome
•Dana/Ardent/
Stellar/Stardent
© 2006, [email protected]
Research 1985 – 1995 [Gordon Bell, keynote ISCA 2000]
•DAPP
•Denelcor
•Elexsi
•ETA Systems
•Evans and Sutherland
•Computer
•Floating Point Systems
•Galaxy YH-1
•Goodyear Aerospace MPP
•Gould NPL
•Guiltech
•ICL
•Intel Scientific Computers
•International Parallel Machines
•Kendall Square Research
•Key Computer Laboratories
•MasPar
14
•Meiko
•Multiflow
•Myrias
•Numerix
•Prisma
•Tera
•Thinking Machines
•Saxpy
•Scientific Computer
•Systems (SCS)
•Soviet Supercomputers
•Supertek
•Supercomputer Systems
•Suprenum
•Vitesse Electronics
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• Preface
• The von Neumann paradigm trap
• Supercomputing: the wrong Road Map
• The Solution ignored for decades
• Fine-grained vs. coarse-grained
• The wrong Road Map for CS Curricula
http://www.uni-kl.de
• Conclusions
© 2006, [email protected]
15
http://hartenstein.de
Moving Data around
TU Kaiserslautern
Crossbar weight: 220 t, 3000 km of thick cable,
5120 Processors, 5000 pins each
ES 20: TFLOPS
peak or sustained?
© 2006, [email protected]
16
http://hartenstein.de
TU Kaiserslautern
The Memory Wall (1)
Moving data to
the processor:
© 2006, [email protected]
17
http://hartenstein.de
Data meeting the Processing Unit (PU)
TU Kaiserslautern
We have
2 choices
routing the data by
memory-cycle-hungry
instruction streams
placement of the
execution locality
by Software
by
Configware
optimize a pipe network:
place PU in data stream
© 2006, [email protected]
18
http://hartenstein.de
Illustrating the von Neumann paradigm trap
the watering pot model
TU Kaiserslautern
[Hartenstein]
The instruction-stream-based approach
many watering pots
The data-stream-based approach
von
Neumann
bottleneck
© 2006, [email protected]
19
http://hartenstein.de
TU Kaiserslautern
The Memory Wall (2)
Key problem is the inefficiency
and complexity of moving data,
not processor performance.
Most important goal is the
minimization of the number
of main memory cycles.
Tear down
this Wall !
© 2006, [email protected]
Supercomputing urgently needs a
fundamentally different approach
toward interconnect efficiency.
20
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• Preface
• The von Neumann paradigm trap
• Supercomputing: the wrong Road Map
• The Solution ignored for decades
• Fine-grained vs. coarse-grained
• The wrong Road Map for CS Curricula
http://www.uni-kl.de
• Conclusions
© 2006, [email protected]
21
http://hartenstein.de
TU Kaiserslautern
The Systolic Array
nice time/space
notation - defines:
... which data item
time
at which time
at which port
x
x
x
(pipe network) DPA*
*) DataPath Array
(array of DPUs)
DataPath Unit has
no program counter!
it’s no CPU!
time
(H. T. Kung paradigm)
|
input data stream
|
|
x x x
x x x -
port #
- - - x x x
time
- - - - x x x
x x x - -
- - - - - x x x
port #
|
|
|
|
|
|
|
|
|
|
|
x
x
x
© 2006, [email protected]
x
x
x
x
x
x
CS Mathematicians‘
hobby, early 80ies
time
22
x
x
x
port #
output data streams
|
x
x
x
http://hartenstein.de
Terminology
TU Kaiserslautern
term
CPU
CPU
DPU**
DPU
progra
m
counter
DPU
execution
program triggered
counter
by
instructioninstruction
streamfetch
based
yes
data
arrival*
no
**) does not have a program counter
© 2006, [email protected]
paradigm
23
datastreambased
*) “transport-triggered”
http://hartenstein.de
The new paradigm: how the data are traveling
TU Kaiserslautern
[Jack Lipovski,
EUROMiCRO,
better not by instruction execution
Nice, 1975]
An old hat: transport-triggered + instruction-driven
DPU
pipeline, or chaining
DPU
DPU
vN Move Processor
instruction-driven
super systolic array
P&R: move locality of
operation, not data !
© 2006, [email protected]
24
http://hartenstein.de
The right road map to HPC:
TU Kaiserslautern there ignored for decades
massively reducing memory cycles
DPA
DPU operation is
transport-triggered
|
- - - x x x
- - - - x x x
x x x - -
nor thru common memory
- - - - - x x x
|
|
|
|
|
|
|
|
|
|
|
x
x
x
where were the
supercomputing people ?
© 2006, [email protected]
|
25
input data streams
|
x x x
x x x -
no instruction streams
no message passing
x
x
x
x
x
x
x
x
x
x
x
x
output data streams
|
x
x
x
http://hartenstein.de
Mathematicians X-ing
TU Kaiserslautern
Systolic
Synthesis
Mathematicians like the
beauty and elegance
of Systolic Arrays.
Due to a lacking intradisciplinary view, their
efforts yielded poor
synthesis algorithms.
Reiner Hartenstein
© 2006, [email protected]
26
http://hartenstein.de
TU Kaiserslautern
Synthesis Method?
of course, algebraic !
Algebraic means linear projection, restricted to
uniform arrays, only with linear pipes
useful only for applications with
strictly regular data dependencies:
Mathematicians caught by their own paradigm trap
for more than a decade
rDPA:
Generalization* by a transdisciplinary hardware guy:
Rainer Kress discarded their algebraic synthesis
methods and replaced it by simulated annealing. 1995
*) super-systolic
© 2006, [email protected]
27
http://hartenstein.de
TU Kaiserslautern
Generating the Data Streams
Who generates the
data streams ?
Mathematicians:
it‘s not our job
DPA
x
x
x
x
x
x
|
x
x
x
|
|
x x x
x x x -
- - - x x x
- - - - x x x
x x x - -
© 2006, [email protected]
- - - - - x x x
|
|
|
|
|
|
|
|
|
|
|
x
x
x
(it‘s not algebraic)
28
input data streams
x
x
x
output data streams
|
x
x
x
http://hartenstein.de
use data counters,
no program counter
x
x
x
|
|
|
x x x - -
32 ports, or
n x 32 ports
© 2006, [email protected]
|
|
|
|
|
|
|
|
|
|
x
x
x
x
x
x
29
|
x
x
x
ASM
other example
|
ASM
50 & more
on-chip ASM
are feasible
x
x
x
x x x
x x x -
ASM
implemented ASM
by distributed ASM
on-chip memory ASM
x
x
x
ASM
reconfigurable
(pipe network) rDPA
ASM
ASM
TU Kaiserslautern
ASM Data stream
generators
- - - x x x
ASM
- - - - x x x
ASM
- - - - - x x x
ASM
non-von-Neumann
machine paradigm
GAG
RAM
data
counter
ASM: AutoSequencing
Memory
http://hartenstein.de
Compilation: Software vs. Configware
TU Kaiserslautern
Software
Engineering
source program
software
compiler
software code
instruction streams
© 2006, [email protected]
Configware
Engineering
C, FORTRAN
MATHLAB, …
placement source „program“
& routing
mapper
configware
compiler
data scheduler
configware
code
flowware code
data streams
configuration
30
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• Preface
• The von Neumann paradigm trap
• Supercomputing: the wrong Road Map
• The Solution ignored for decades
• Fine-grained vs. coarse-grained
• The wrong Road Map for CS Curricula
http://www.uni-kl.de
• Conclusions
© 2006, [email protected]
31
http://hartenstein.de
Coarse-grained vs. fine-grained
TU Kaiserslautern
Definition of FPGA see previous talk by Dr. Thomas Steinke
device
granularity
path width eff’ve density flexibility
general
FPGA
fine-grained
~ 1 bit
very low
purpose
DPA
coarse-grained multi bit:
specialized
very high
rDPA
coarse-grained e.g. 32 bits
domainplatform fine-grained &
specific
mixed
high
FPGA
embedded hdw.
© 2006, [email protected]
32
http://hartenstein.de
connect box
switch box
reconfigurable interconnect fabrics
TU Kaiserslautern
FPGA with island architecture
reconfigurable
logic box
© 2006, [email protected]
33
http://hartenstein.de
A
TU Kaiserslautern
example:
wire
routed
for 1 net
B
© 2006, [email protected]
34
http://hartenstein.de
Why coarse grain
TU Kaiserslautern
much more area-efficient
instead of rLB (~1 bit wide)
much less
use rDPU (e. g. 32 bits wide)
reconfigurability
overhead
reconfigurable Data Path Unit (e. g. rALU)
much more MOPS/milliWatt
instead of FPGA use rDPA
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
mind set close to classical computing background
© 2006, [email protected]
35
http://hartenstein.de
Coarse grain is about computing, not logic
TU Kaiserslautern
Example: mapping onto rDPA by DPSS: based on simulated annealing
SNN filter on KressArray (mainly a pipe network)
rout thru only
array size:
10 x 16
= 160 rDPUs
no CPU
reconfigurable
function block, [Ulrich Nageldinger]
e. g. 32 bits wide
L egend :
© 2006, [email protected]
rD PU n ot u sed
used for
ro uting on ly
backbus
connect
backb us conn ect
36
operator and rou tin g
po rt lo
cation mark er
not
used
http://hartenstein.de
(r)DPA
TU Kaiserslautern
commercial rDPA example:
PACT XPP - XPU128
XPP128 rDPA
ALU
• Full 32 or 24 Bit Design working silicon
• 2 Configuration Hierarchies
• Evaluation Board available, and
• XDS Development Tool with Simulator
© 2006, [email protected]
buses
not
shown
C trl
CFG
rDPU
PAE
c o re
© PACT AG, http://pactcorp.com
37
http://hartenstein.de
e. g.: array w. 56 rDPUs: running under 500 MHz
TU Kaiserslautern
World TV &
game console &
multi media center
• Variable resolutions and refresh rates
Games
• Variable scan mode characteristics
• Noise Reduction and Artifact Removal
• High performance requirements
• Variable file encoding formats
• Variable content security formats
Camera
• Variable Displays
• Luminance processing
• Detail enhancement
• Color processing
SD/MMC Cards
• Sharpness Enhancement
• Shadow Enhancement
• Differentiation
• Programmable de-interlacing heuristics
• Frame rate detection and conversion
Radio• Motion detection & estimation & compensation
Interface
• Different standards (MPEG2/4, H.264)
• A single device handles all modes
http://pactcorp.com
© 2006, [email protected]
Videos
Music
SMeXPP
rDPA
LCD DISPLAY
BasebandProcessor
38
Audio-
Interface
http://hartenstein.de
TU Kaiserslautern
500MHz PowerPC™ Processors
(680DMIPS) with
Auxiliary Processor Unit
DSP platform FPGA
[courtesy Xilinx Corp.]
500MHz Programmable DSP
Execution Units
500MHz multi-port
Distributed 10 Mb SRAM
500MHz Flexible
Soft Logic Architecture
200KLogic Cells
500MHz DCM Digital
Clock Management
0.6-11.1Gbps
Serial Transceivers
1Gbps Differential I/O
© 2006, [email protected]
39
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• Preface
• The von Neumann paradigm trap
• Supercomputing: the wrong Road Map
• The Solution ignored for decades
• Fine-grained vs. coarse-grained
• The wrong Road Map for CS Curricula
http://www.uni-kl.de
• Conclusions
© 2006, [email protected]
40
http://hartenstein.de
TU Kaiserslautern
Joint Task Force for Computing Curricula 2004
fully ignores
Reconfigurable Computing
Curricula ?
FPGA & synonyma: 0 hits
(Google: 10 million hits)
not even here
© 2006, [email protected]
41
http://hartenstein.de
TU Kaiserslautern
Curriculum Recommendations, v. 2005
Upon my complaints* the only change: including
at end of last paragraph of the survey volume:
"programmable hardware (including
FPGAs, PGAs, PALs, GALs, etc.)."
However, no structural changes at all
v. 2005 intended to be the final version (?)
torpedoing the transdisciplinary
responsibility of CS curricula
This is criminal !
Peter Denning …
© 2006, [email protected]
42
*) no reply
http://hartenstein.de
TU Kaiserslautern
Here is the common model
it’s not von Neumann
most accumulated
MIPS have been
migrated here
mainly just for
running legacy software code configware code
code etc.
instructiondatastreambased
CPU
the tail is
wagging
the dog
© 2006, [email protected]
43
streambased
reconfigurable
accelerator
hardwired
accelerator
http://hartenstein.de
Dual Paradigm Application Development
TU Kaiserslautern
high level language
Juergen Becker’s
CoDe-X, 1996
C language source
Partitioner
SW
compiler
CPU
CW
compiler
software/configware
co-compiler
software code
instructionstreambased
CPU
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
reconfigurable
accelerator
hardwired
accelerator
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
© 2006, [email protected]
configware code
datastreambased
44
http://hartenstein.de
For Transdisciplinary CS Education
TU Kaiserslautern
The
von-Neumann-only
mind set is obsolete
structural
procedural
procedural-only
datastreambased
instructionstreambased
We need a curricular
dual-paradigm approach
© 2006, [email protected]
45
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• Preface
• The von Neumann paradigm trap
• Supercomputing: the wrong Road Map
• The Solution ignored for decades
• Fine-grained vs. coarse-grained
• The wrong Road Map for CS Curricula
http://www.uni-kl.de
• Conclusions
© 2006, [email protected]
46
http://hartenstein.de
Taxonomy of Algorithm Migration (1)
TU Kaiserslautern
(Instruction-stream-based algorithm taxonomy:
partially existing, not really systematic)
Algorithms migrated to time-space domain
(for RC): a taxonomy is not existing
Computationally intensive applications are
the best candidates for migration to FPGA
A few algorithms (e. g. Turbocode or Viterbi)
require a massive amount of interconnect
bulk data bases might be subject of FPGA usage
to avoid memory cycles for address computation
Steadily coming and going data streams are best candidates
© 2006, [email protected]
47
http://hartenstein.de
Taxonomy of Algorithm Migration (2)
TU Kaiserslautern
Migration efficiency (reducing memory cycles):
Servers: to be investigated - for sure is:
• loop transformations: efficient, deterministic
• caches: indeterministic and energy guzzlers
• much less local memory needed
• secondary data memory: distributed on-chip
memory architectures highly promising
• address computations: efficient migration
© 2006, [email protected]
48
http://hartenstein.de
Conclusions
TU Kaiserslautern
excellent results proven for
computationally intensive applications
highly promising for servers
improvements likely for bulk
data & storage applications
tool and language scenario needs an
urgent transdisciplinary clean-up
© 2006, [email protected]
49
http://hartenstein.de
TU Kaiserslautern
thank you
© 2006, [email protected]
50
http://hartenstein.de
TU Kaiserslautern
END
© 2006, [email protected]
51
http://hartenstein.de
TU Kaiserslautern
Backup:
© 2006, [email protected]
52
http://hartenstein.de
SW / CW Co-Compilation
TU Kaiserslautern
C, FORTRAN, MATHLAB, etc.
automatic SW / CW partitioner
Software /
Configware
software Co-Compiler
compiler
mapper
configware
compiler
data scheduler
software code
configware code
© 2006, [email protected]
53
flowware code
http://hartenstein.de
TU Kaiserslautern
Why 2 different Codes needed ?
Nick Tredennick’s Paradigm Shifts
Software Engineering
CPU
software
resources: fixed
algorithm: variable
1 programming
source needed
Configware Engineering
configware
flowware
© 2006, [email protected]
resources: variable
algorithm: variable
54
2 programming
sources needed
http://hartenstein.de
configware solution: computing in space
for demo: a tiny section of the pipe network
inter-rDPU-communication: no memory cycles needed
TU Kaiserslautern
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
+
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
S
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
© 2006, [email protected]
55
http://hartenstein.de
TU Kaiserslautern
Compare it to software solution on CPU
S = R + (if C then A else B endif);
R B A
on a very simple CPU memory
C = 1 cycles
C =1
nano
seconds
read instruction
if C
then
read A
instruction decoding
read operand*
operate & register transfers
if not C
then
read B
+
+
S
S
Clock
200
read instruction
instruction decoding
read instruction
add &
store
instruction decoding
operate & register transfers
store result
total
© 2006, [email protected]
56
http://hartenstein.de
section of a major pipe network on rDPU
hypothetical branching example to illustrate
software-to-configware migration
TU Kaiserslautern
S = R + (if C then A else B endif);
R B A
C =1
+
S
clock
200 MHz
(5 nanosec)
© 2006, [email protected]
C=1
simple conservative CPU example
read instruction
instruction decoding
if C
then read A read operand*
operate & reg. transfers
read instruction
if not C
then read B instruction decoding
read instruction
instruction decoding
add & store
operate & reg. transfers
store result
total
memory nano
cycles seconds
1
100
1
100
1
100
1
100
1
5
100
500
*) if no intermediate storage in register file
57
http://hartenstein.de
The wrong mind set ....
TU Kaiserslautern
S = R + (if C then A else B endif);
section of a very
large pipe network:
R B A
C =1
„but you can‘t implement decisions!“
not knowing this solution:
symptom of the
hardware / software chasm
+
© 2006, [email protected]
and the
configware / software chasm
58
http://hartenstein.de
TU Kaiserslautern
(anti-von-Neumann machine paradigm)
ASM
GAG
ASM: AutoSequencing
Memory
RAM
Generalization
of the DMA
data
counter
GAG & enabling technology:
published 1989 [by TU-KL],
Survey paper: [M. Herz et al.*:
IEEE ICECS 2003, Dubrovnik]
patented by TI** 1995
© 2006, [email protected]
Data Counter
instead of
Program Counter
59
Storge Scheme
optimization
methodology, etc.
*) IMEC & TU-KL
**) -http://hartenstein.de
1986: Xputer Lab at Kaiserslautern: MoM I and II
TU Kaiserslautern
© 2006, [email protected]
60
http://hartenstein.de
TU Kaiserslautern
© 2006, [email protected]
61
http://hartenstein.de