Coarse Grain Reconfigurable Architectures

Download Report

Transcript Coarse Grain Reconfigurable Architectures

Reconfigurable HPC
May 14, 2004 , TU Tallinn, Estonia
Reconfigurable HPC
Reiner Hartenstein
TU Kaiserslautern
part 2
Data-Stream-based
Computing
Preface
TU Kaiserslautern
Response to the Computenik Shock:
the completely wrong roadmap to HPC
... continue to bang their heads
against the memory wall
blinders
to ignore
the impact
of morphware
© 2004, [email protected]
2
instead of
http://hartenstein.de
Crusty Computing Sciences
TU Kaiserslautern
memory
wall
went morphware
dead
went morphware
exhausted
98.5% vN-only
this monopoly
is the problem
[David Padua,
John Hennessy]
© 2004, [email protected]
3
http://hartenstein.de
Exhausted: Dead Supercomputer Society
[Gordon Bell, keynote at ISCA 2000].
TU Kaiserslautern
•ACRI
•Alliant
•American
Supercomputer
•Ametek
•Applied Dynamics
•Astronautics
•BBN
•CDC
•Convex
•Cray Computer
•Cray Research
•Culler-Harris
•Culler Scientific
•Cydrome
•Dana/Ardent/
Stellar/Stardent
•DAPP
•Denelcor
•Elexsi
•ETA Systems
•Evans and Sutherland
•Computer
•Floating Point Systems
•Galaxy YH-1
•Goodyear Aerospace MPP
•Gould NPL
•Guiltech
•ICL
•Intel Scientific Computers
•International Parallel
Machines
•Kendall Square Research
•Key Computer Laboratories
© 2004, [email protected]
4
•MasPar
•Meiko
•Multiflow
•Myrias
•Numerix
•Prisma
•Tera
•Thinking Machines
•Saxpy
•Scientific Computer
•Systems (SCS)
•Soviet Supercomputers
•Supertek
•Supercomputer Systems
•Suprenum
•Vitesse Electronics
http://hartenstein.de
HPC going configware
TU Kaiserslautern
International Conference on
Field-Programmable Logic
and Applications (FPL)
http://fpl.org
Aug. 20 – Sept 1, 2004, Antwerp, Belgium
... going into every type of application
µProc. accel.
© 2004, [email protected]
288 submissions !
they all work on high http://hartenstein.de
performance
5
TU Kaiserslautern
>> Machine vs. Anti Machine <<
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
6
http://hartenstein.de
TU Kaiserslautern
Reconfigurable Computing:
a second programming domain
Migration of programming to the structural domain
The structural domain has become RAM-based
The opportunity to introduce
the structural domain to programmers ...
... to bridge the gap by clever abstraction mechanisms
using a simple new machine paradigm
© 2004, [email protected]
7
http://hartenstein.de
RAM
memory
CPU
memory
bank
asM
matter vs.
anti matter data stream machine
data
counter
(anti machine)
DPU
progra
m
counter
(r)DPA
without
sequencer
asM
asM
asM
(r)DPA
asM
........
TU Kaiserslautern
machine vs.
anti machine
© 2004, [email protected]
8
........
asM
instruction stream machine
(von Neumann etc.)
asM: auto-sequencing Memory
asMA: auto-sequencing Memory Array
http://hartenstein.de
TU Kaiserslautern
Data-stream-based Computing
The Anti Universe of Computing
© 2004, [email protected]
9
http://hartenstein.de
The anti universe
TU Kaiserslautern
• Paul Dirac predicted a complete
anti universe consisting of antimatter
• “There are regions in the universe,
which consist of antimatter .....
• .... But there are asymmetries”
• when a particle hits its antiparticle, both
are converted into energy: Annihilation
• We are not aware, that there is a new area in computing
sciences , which consists of antimatter of computing
• Reconfigurable Computing is made from this antimatter:
data-stream-based computing
© 2004, [email protected]
10
http://hartenstein.de
anti particles
TU Kaiserslautern
• 1928: Paul Dirac: „there should be an anti electron
having positive charge“ (Nobel price 1933)
• 1932: Carl David Anderson detected this „positron“
in cosmic radiation (Nobel price 1936)
hydrogen
• 1954: new accelerators: cyclotron,
like Berkeley‘s Bevatron
anti hydrogen
• 1955 Owen Chamberlain et al.
create anti proton on Bevatron
• 1956: anti neutron created on Bevatron
• 1965: creation of a deuterium
anti nucleus at CERN
• 1995: hydrogen anti atom created at
CERN – by forcing positron and anti
proton to merge by very low energy.
© 2004, [email protected]
11
http://hartenstein.de
Matter & Antimatter: Atom and Anti Atom
TU Kaiserslautern
-
+
Anti Matter machine paradigm:
Anti Atom
The World of Matter machine paradigm:
the Atom
© 2004, [email protected]
12
+
http://hartenstein.de
TU Kaiserslautern
Matter & Antimatter of Informatics :
Machine and Anti Machine
CPU
-
+
Anti Machine paradigm
1936
1946
1971
1979
1990
1995
Machine paradigm:
„von Neumann“
© 2004, [email protected]
1st electronic computer (Konrad Zuse)
v. N. machine paradigm
1st microprocessor (Ted Hoff)
„data streams“ (systolic array: Kung / Leiserson)
anti machine paradigm published
rDPA / DPSS (supersystolic: Rainer Kress)
novel
compilation
techniques
13
DPU
+
-
http://hartenstein.de
Matter vs. antimatter: CPU vs. DPU
CPU
+
+
-
CPU
DPU
progra
m
counter
© 2004, [email protected]
instruction
stream
14
data streams
stream
+
-
DPU
+
TU Kaiserslautern
(r)DPA
(r)DPU
without
sequencer
http://hartenstein.de
heavy anti atoms: DPA = DPU array
coherent data streams
spinning around
TU Kaiserslautern
+
DPU
DPU
DPU
+
DPU
DPU
DPU
DPU
DPU
-
+
+
-
-
-
-
15
+
DPU
DPA
-
© 2004, [email protected]
+
+
-
DPA
+
-
+
http://hartenstein.de
Terminology: DPU versus CPU ...
TU Kaiserslautern
•
•
•
•
•
•
DPU: data path unit
DPA: DPU array
GA: gate array
rDPU: reconfigurable DPU
rDPA: reconfigurable DPA
rGA: reconfigurable GA
(r)DPA
(r)DPU
• DPU is no CPU:
there is nothing central
CPU
- like in a DPA
© 2004, [email protected]
16
DPU
DPU
instruction
sequencer
http://hartenstein.de
Parallelism by Concurrency
TU Kaiserslautern
independent instruction streams
difficult ...
+
+
-
-
© 2004, [email protected]
+
+
-
+
17
-
+
-
+
http://hartenstein.de
Annihilation?
TU Kaiserslautern
-
avoidable
by careful
methodology
+
© 2004, [email protected]
18
+
http://hartenstein.de
CS education .....
TU Kaiserslautern
structural
Configware / Software Co-Design?
Hardware / Software Co-Design?
procedural
hardware person
Annihilation avoidable
by CS curricular revision
© 2004, [email protected]
19
software person
http://hartenstein.de
>> Terminology <<
TU Kaiserslautern
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
20
http://hartenstein.de
„Re-configurable Hardware“ ??
TU Kaiserslautern
Terminology has been highly confusing
„Re-configurable Hardware“ ??
this „Hardware“ is not hard !
it‘s Morphware
We need a concise terminology:
a consensus is on the way
© 2004, [email protected]
21
http://hartenstein.de
TU Kaiserslautern
Configware and Flowware ...
... are the sources for programming morphware.
Software is the source for programming traditional
hardwired processors (instruction-stream-driven:
von Neumann machine paradigm and its derivatives)
For Configware and Flowware we prefer the
anti machine paradigm – conterpart of von Neumann
© 2004, [email protected]
22
http://hartenstein.de
de facto Duality of RAM-based platforms
TU Kaiserslautern
We now have 2 types of programmable platforms
hardware viewed as
frozen configware:
Just earlier binding
traditional
RAM-based platform CPU
„running“ on it:
machine paradigm
© 2004, [email protected]
software
2nd paradigm
new
morphware (FPGA, rDPA ..)
configware
von Neumann etc.:
anti machine:
instruction-stream-based data-stream-based
23
http://hartenstein.de
TU Kaiserslautern
Terminology: Digital System Platforms
clearly distinguished
source
running on it
platform
hardware
(not running on it)
fine grain rGA (FPGA)
configware
morphware coarse
rDPU, rDPA
grain
reconfigurable flowware &
data stream
configware
processor
data stream processor (hardwired)
flowware
instruction stream processor
software
© 2004, [email protected]
machine
paradigm
24
none
anti machine
von Neumann
machine
http://hartenstein.de
TU Kaiserslautern
Terminology: Digital System Platforms
clearly distinguished
source
running on it
platform
hardware
(not programmable)
fine grain rGA (FPGA)
configware
morphware coarse
rDPU, rDPA
grain
reconfigurable flowware &
data stream
configware
processor
data stream processor (hardwired)
flowware
instruction stream processor
software
© 2004, [email protected]
machine
paradigm
25
none
anti machine
von Neumann
machine
http://hartenstein.de
Importance of binding time
TU Kaiserslautern
0
c 1
time of
“instruction read new read new
fetch” instruction instruction
Microprocessor
run time
not all switching is done
by Configware
Parallel Computer
load time
Reconfigurable
for
a
pipe
network
Configuration: like
Computing
a kind of pre-packed
compile time
frozen-in „super
instruction fetch“
configure
datapaths
fabrication time
fabricate a
datapath
© 2004, [email protected]
26
c
0
0
1
1
c
Full custom
oder ASIC
http://hartenstein.de
Software vs Flowware and Configware
TU Kaiserslautern
Programming source for instruction-stream-based
computing (von Neumann etc.):
Software
The programming source for data-stream-based
computing operations (the anti machine paradigm):
Flowware
Programming sources for
Reconfigurable Computing
(morphware):
µProc.
d.schedule
hdw.anti
machine
compile
Flowware and Configware
Sources for Embedded Systems:
Flowware, Configware & Software
27
rec.anti
machine
partit. compiler
µProc.
© 2004, [email protected]
compile
rec.anti
machine
http://hartenstein.de
control-procedural vs. data-procedural
TU Kaiserslautern
The structural domain is primarily data-stream-based:
Flowware
..... mostly not yet modelled that way:
most flowware is hidden by its indirect
instruction-stream-based implementation
Flowware converts „procedural vs. structural“
into „control-procedural vs. data-procedural“ ...
© 2004, [email protected]
28
http://hartenstein.de
TU Kaiserslautern
Flowware Languages
(Data Streaming Language examples)
Brook: for modern graphics hardware
DSP-C: allows to describe key features of DSPs
www.dsp-c.org
Streams-C: defines 1-D streams; generates VHDL
(LANL Open-source C compiler targets FPGAs)
general purpose:
MoPL: fully supporting the anti machine paradigm
– the counterpart of the von Neumann paradigm
© 2004, [email protected]
29
http://hartenstein.de
programming: procedural vs. structural
instruction-stream-based
TU Kaiserslautern
embedded systems:
domain
procedural
computing in ...
time only*
data-stream-based
structural
space and time
hardwired
reconfigurable
currently emerging
program source software*
(hardware +) (hardware +) configware +
software** flowware
flowware
before fabrication at loading time
„instruction“ fetch
at runtime
data „fetch“
at run time
not programmable
fully hardwired:
algorithms fixed
resources fixed
*) only one
**) software „simulates“ flowware
source needed
reconfigurable:
CPU:
algorithms variable
algorithms variable
resources variable
resources fixed
© 2004, [email protected]
30
http://hartenstein.de
approaching
consensus digital system platforms:
platform
DPU data path unit
category
rDPU reconfigurable DPU
DPA data path array (DPU array) hardware
rDPA reconfigurable DPA
ISP**
ISP instruction set processor
• morphware
AM anti machine
AMP data stream processor*
data stream
rAMP reconfigurable AMP
processor (AMP*)
*) no “dataflow machine”
reconfigurable
categories of morphware: AMP (rAMP)
TU Kaiserslautern
morphware use
source „running“
on platform
(not programmable)
software
configware
flowware &
configware
• fine grain (FPGA) (~1 bit)
coarse grain (e.g. 32 bits)
reconfigurable computing
multi granular: by slice bundling
© 2004, [email protected]
31
machine
paradigm
none
von Neumann
FPGA: none
flowware
granularity (path width)
reconfigurable logic
Glossary
anti machine
(re)configurable blocks
CLBs
rDPUs (e.g. ALU-like)
rDPU slices (e.g. 4 bits)
**) Von Neumann etc.
http://hartenstein.de
*) data stream
processor
TU Kaiserslautern
data streams*: not new
1980: data streams (Kung, Leiserson: systolic arrays)
1989: data-stream-based Xputer architecture
1990: rDPU (Rabaey)
1994: Flowware Language MoPL (Becker et al.)
1995: super systolic array (rDPA) + DPSS tool (Kress)
1996+: Streams-C language, SCCC (Los Alamos), SCORE,
ASPRC, Bee (UC Berkeley), DSP-C, Brook, ...
1996: configware / software partitioning compiler (Becker)
© 2004, [email protected]
32
http://hartenstein.de
URLs: Software vs Flowware and Configware
TU Kaiserslautern
http://morphware.net/
http://configware.org/
http://flowware.net/
http://data-streams.org/
http://anti-machine.org/
http://kressarray.de/
compile
µProc.
d.schedule
compile
hdw.anti
machine
rec.anti
machine
partit. compiler
µProc.
© 2004, [email protected]
33
rec.anti
machine
http://hartenstein.de
TU Kaiserslautern
>> Embedded Computing <<
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
34
http://hartenstein.de
Fine-grain Morphware: Drawbacks
TU Kaiserslautern
• FPGA Architectures
– SRAM-based Look-up Tables (LUTs)
– Problems:
• Routing: reduces Performance
• Bad Ratio: active / passive Elements
reconfigurable Interconnect (Switching Boxes)
Configurable Logic
Block (CLB)
LUT
© 2004, [email protected]
35
Source: R. Hartenstein
http://hartenstein.de
TU Kaiserslautern
area used by
application
Reconfigurability Overhead
L
L
partly for
configuration
code storage
S
L
© 2004, [email protected]
S
L
resources
needed for
reconfigurability
“hidden RAM”
not shown
L
S
L
S
L
36
L
L
http://hartenstein.de
Throughput vs. Efficiency
TU Kaiserslautern
area used by
application
T. Claasen et al.: ISSCC 1999
*) R. Hartenstein: ISIS 1997
MOPS / mW
1000
L
100
L
L
L
S
1
L
S
L
L
resources
needed for
reconfigurability
0.01
0.001
L
1 Bit CLB
0.1
Wiring by abutment:
32 Bit example
S
S
10
L
2
© 2004, [email protected]
1
0.5
37
0.25
0.13 0.1 0,07 µ feature size
http://hartenstein.de
One more argument for coarse grain
TU Kaiserslautern
T. Claasen et al.: ISSCC 1999
*) R. Hartenstein: ISIS 1997
MOPS / mW
1000
100
10
1
0.1
Wiring by abutment:
a 32 Bit KressArray
example
0.01
0.001
2
© 2004, [email protected]
1
0.5
0.25
38
if coarse grain cells
are full custom and
mesh-connected,
and 2nd level
interconnect
ressources layouted
over the cells
the array is
almost as
area-efficient
as hardwired
0.13 0.1 0,07 µ feature size
http://hartenstein.de
It’s a Paradigm Shift !
TU Kaiserslautern
• Using FPGAs (fine grain reconfigurable) just
mainly has been classical Logic Synthesis on
a “strange hardware” platform
• Coarse Grain Reconfigurable Arrays (rDPAs)
(Reconfigurable Computing), however,
mean a really fundamental Paradigm Shift
• This is still ignored by CS and EE
Curricula and almost all R&D scenes
© 2004, [email protected]
39
http://hartenstein.de
rDPA (coarse grain) vs. FPGA (fine grain)
TU Kaiserslautern
Status: ~1998
roughly:
performance
(MOPS/mW,
orders of magnitude)
µProc
DSP
FPGA
rDPA
hardwired
© 2004, [email protected]
0
1
2
3
3
roughly:
area efficiency
(transistors/chip,
orders of magnitude)
µProc
commodity FPGA
rDPA
hardwired
40
0
2
4
4
http://hartenstein.de
TU
Kaiserslautern
System
gates
10 000 000
Mega-rGAs
per rGA chip
[Xilinx Data]
planned
Virtex II
1 000 000
Virtex
XC 40250XV
XC 4085XL
100 000
10 000
1 000
500
200
100
1984
1986
1988
© 2004, [email protected]
1990
1992
1994
41
1996
1998
2000
Jahr
2002
2004
http://hartenstein.de
TU Kaiserslautern
entire system on a single chip
• Xilinx Virtex-II Pro
FPGA Architecture
• PowerPC 405
RISC CPU
(PPC405) cores
• FPGA Fabric-based
on Virtex-II
Architecture
all you need on board
Rocket
IO
Power PC
Core
On Chip
Memory
Controller
Embeded
RAM
Source: Ivo Bolsens, Xilinx
© 2004, [email protected]
42
http://hartenstein.de
[ST microelectronics]
Mask & NRE cost
TU Kaiserslautern
© 2004, [email protected]
43
http://hartenstein.de
>> coarse-grained Platforms <<
TU Kaiserslautern
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
http://www.uni-kl.de
• Final Remarks
© 2004, [email protected]
44
http://hartenstein.de
Computing in space and time
TU Kaiserslautern
this dichotomy is
completely ignored
by our CS curricula
y1
-
x3
y2
y3
- -
a13
a23
a33
-
a12
a22
a32
x1
a11
a21
a31
-
-
x2
y 1(0)
y 2( 0)
y 3( 0)
© 2004, [email protected]
placement
computing
computing systolic
arrays
in space
in time
etc.
data
streams
45
migration by re-timing
and other transformations
http://hartenstein.de
TU Kaiserslautern
Flowware programs
data streams
Flowware defines:
... which data item
time
at which time
at which port
x
x
x
DPA
time
x
x
x
|
x
x
x
|
|
x x x
x x x -
port #
- - - x x x
time
- - - - x x x
x x x - -
- - - - - x x x
port #
|
|
|
|
|
|
|
|
|
|
|
x
x
x
© 2004, [email protected]
input data streams
time
x
x
x
46
port #
output data streams
|
x
x
x
http://hartenstein.de
Mathematicians X-ing
TU Kaiserslautern
Mathematicians like
to wax rhapsodic
about the elegance,
beauty and depth of
their proofs.
Systolic
Synthesis
John Horgan: „The End
of Science Revisited“
Mathematicians like the
beauty and elegance
of Systolic Arrays.
Due to a lack of depth
in understanding, their
efforts yielded poor
synthesis algorithms.
Reiner Hartenstein
© 2004, [email protected]
47
http://hartenstein.de
Generalized Stream-based Computing System
TU Kaiserslautern
heterogenous Array of rDPUs (reconf. data path units)
The same mapper for both:
Reconfigurable,
or hardwired
Kress DPSS [1995]
y
a
*
+
DPU architectures
x expression tree
1
2
3
4
+ + * xf
sh sh * + + * xf
sh sh
* -
© 2004, [email protected]
simultaneous
placement
& routing
data
streams
48
Mapper
Scheduler
Configware
Compiler
http://hartenstein.de
Supersystolic Array Principles
TU Kaiserslautern
• take systolic array principles
• replace classical synthesis by simulated annealing
• yields the supersystolic array
• a generalization of the systolic array
• no more restricted to regular data dependencies
• now reconfigurability makes sense: use morphware
© 2004, [email protected]
49
http://hartenstein.de
flowware history
TU Kaiserslautern
time
flowware history:
1980: data streams
(Kung, Leiserson)
time
1995: super systolic
rDPA (Kress)
1996+: SCCC (LANL),
SCORE, ASPRC,
Bee (UCB), ...
x
x
x
DPA
|
|
port #
- - - x x x
time
- - - - x x x
x x x - -
- - - - - x x x
port #
|
|
|
|
|
|
|
|
|
|
|
x
x
x
50
input data streams
|
x x x
x x x -
(tutorials and courses available on all this)
© 2004, [email protected]
x
x
x
x
x
x
time
x
x
x
port #
output data streams
|
x
x
x
http://hartenstein.de
computing paradigms and methodologies
TU Kaiserslautern
1946: machine paradigm (von Neumann)
1989: anti machine paradigm
1990: rDPU (Rabaey)
1994: anti machine high level programming language
1995: super systolic array (rDPA)
flowware*
1980: data streams (Kung, Leiserson)
1996+: SCCC (LANL), SCORE, ASPRC, Bee (UCB), ...
1997+: discipline of distributed memory architecture
1997: configware / software partitioning compiler
© 2004, [email protected]
51
http://hartenstein.de
Super Pipe Networks
TU Kaiserslautern
array
systolic
array
applications
regular data
dependencies
only
supersystolic
rDPA
*
pipeline properties
shape
resources
linear
only
uniform
only
mapping
linear projection or
algebraic synthesis
simulated
annealing or
P&R algorithm
no restrictions
scheduling
(data stream
formation)
(e.g. force-directed)
scheduling
algorithm
*) KressArray [1995]
© 2004, [email protected]
52
http://hartenstein.de
Programming Language Paradigms
TU Kaiserslautern
language category
both deterministic
operation
sequence
driven by:
state register
address
computation
Von Neumann Languages
Anti Machine Languages
procedural sequencing: traceable, checkpointable
read next instruction,
read next data item,
goto (instr. addr.),
goto (data addr.),
jump (to instr. addr.),
jump (to data addr.),
instr. loop, loop nesting
data loop, loop nesting,
no parallel loops, escapes,
parallel loops, escapes,
instruction stream branching data stream branching
program counter
data counter(s)
massive memory
overhead avoided
cycle overhead
Instruction fetch
parallel memory
bank access
memory cycle overhead
overhead avoided
interleaving only
no restrictions
language features
control flow +
data manipulation
data streams only
(no data manipulation)
© 2004, [email protected]
53
http://hartenstein.de
Similar Programming Language Paradigms
TU Kaiserslautern
language category
both deterministic
sequencing
driven by:
© 2004, [email protected]
Computer Languages
Xputer Languages
procedural sequencing: traceable, checkpointable
read next instruction,
read next data object,
goto (instruction addr.),
goto (data addr.),
jump (to instruction addr.),
jump (to data addr.),
instruction loop,
data loop,
instruction loop nesting
data loop nesting,
no parallel loops,
parallel data loops,
instruction loop escapes,
data loop escapes,
instruction stream branching data stream branching
54
http://hartenstein.de
Machine Paradigms
TU Kaiserslautern
machine category
Computer (the Machine:
“v. Neumann”)
driven by:
Instruction streams
data streams (no “dataflow”)
engine principles
instruction sequencing
sequencing data streams
state register
single program counter
(multiple) data counter(s)
at run time
at load time
resource
DPU (e.g. single ALU)
DPU or DPA (DPU array) etc.
operation
sequential
parallel pipe network etc.
Communication path set-up
. fetch” )
( “instruction
data
path
*) e g. Bee project Prof. Broderson
© 2004, [email protected]
The Anti Machine
also hardwired implementations*
55
http://hartenstein.de
rDPA
mapping algorithms efficently onto rDPA
TU Kaiserslautern
SNN filter on KressArray
rout thru only
array size:
10 x 16
= 160 rDPUs
Legend:
rDPU not used
backbus connect
used for
routing only
backbus
connect
operator and routing
port location
not
used marker
by the way: example of scalability / relocatability by EDA support
also FPGA scalability (avoid routing congestion) by EDA solution
© 2004, [email protected]
56
http://hartenstein.de
Xplorer Plot: SNN Filter Example
TU Kaiserslautern
http://kressarray.de
2 hor. NNports, 32 bit
3 vert. NNports, 32 bit
route-thru-only rDPU
© 2004, [email protected]
[13]
+
result
operand
57
operator
operand
route thru
backbus connect
http://hartenstein.de
>>> distributed memory
TU Kaiserslautern
The new discipline of
(application-specific)
distributed
memory
[] Herz et al.: proc. IEEE ICECS 2002
© 2004, [email protected]
58
http://hartenstein.de
TU Kaiserslautern
Synthesizable distributed memory architecture...
for a Stream-based Soft Machine
“instructions”
rDPA
Compiler
Memory
Scheduler
(data memory)
memory bank
memory bank
memory bank
...
memory bank
...
memory bank
Sequencers
(data stream
generator)
© 2004, [email protected]
59
http://hartenstein.de
Synthesizable Distributed Memory
TU Kaiserslautern
An example by
Nageldinger’s
KressArray
Xplorer
Efficient Memory
Communication
should be directly
supported by the
Mapper Tools
Legend:
Optimized
Parallel
memory ports Memory
Controller
sequencers
application
not used
http://kressarray.de
© 2004, [email protected]
60
http://hartenstein.de
Coarse Grain Architectures
TU Kaiserslautern
style
project
DP-FPGA
KressArray
Colt
Matrix
RAW
Garp
REMARC
mesh
MorphoSys
CHESS
DReAM
CS2000 family
MECA family
CALISTO
FIPSOC
RaPID
linear
PipeRench
PADDI
Cross
PADDI-2
bar
Pleiades
first source
1994
publ.
1995
1996
1996
1997
1997
1998
1999
1999
2000
2000
2000
2000
2000
1996
1998
1990
1993
1997
architecture
granularity
[4]
2-D array
1 & 4 bit multi-granular
[5,11]
2-D mesh
family: sel. pathwidth
[12]
2-D array
1 & 16 bit
[15]
2-D mesh
8 bit, multi-granular
[17]
2-D mesh
8 bit, multi-granular
[16]
2-D mesh
2 bit
[18]
2-D mesh
16 bit
[19]
2-D mesh
16 bit
[20]
hexagon
4 bit, multi-granular
[21]
2-D array
8 &16 bit
[23]
2-D array
16 & 32 bit
[24]
2-D array
multi-granular
[25]
2-D array
16 bit multi-granular
[26]
2-D array
4 bit multi-granular
[27]
1-D array
16 bit
1-D array
128 bit
[29]
[30]
crossbar
16 bit
[32]
crossbar
16 bit
[33] mesh+crossbar
multi-granular
© 2004, [email protected]
fabrics
mapping
intended target application
Inhomog. routing channels
switchbox routing
multiple NN & bus segments
(co-)compilation
inhomogenous
run time reconfiguration
8NN, length 4 & global lines
multi-length
8NN switched connections
switchbox rout
global & semi-global lines
heuristic routing
NN & full length buses
(info not available)
NN, length 2 & 3 global lines
manual P&R
8NN and buses
JHDL compilation
NN, segmented buses
co-compilation
inhomogenous array
(not disclosed)
(not disclosed)
(not disclosed)
(not disclosed)
(not disclosed)
(not disclosed)
(not disclosed)
segmented buses
channel routing
(sophisticated)
scheduling
central crossbar
routing
multiple crossbar
routing
multiple segmented crossbar
switchbox routing
61
regular datapaths
(adaptable)
highly dynamic reconfig.
general purpose
experimental
loop acceleration
multimedia
(not disclosed)
multimedia
next generation wireless
communication
tele- & datacommunication
tele- & datacommunication
tele- & datacommunication
pipelining
pipelining
DSP
DSP and others
multimedia
http://hartenstein.de
Primarily Mesh-based ….
TU Kaiserslautern
market
project
KressArray
Garp
CHESS
Matrix
research RAW
Colt
DReAM
REMARC
MorphoSys
CALISTO
MECA family
commercial CS2000 family
FIPSOC
XPP XPU128
© 2004, [email protected]
62
bits granularity
source
variable
2
4
U. Kaiserslautern
UC Berkeley
Hewlett Packard
8
M.I.T.
1 & 16
8 &16
Virginia Tech
TU Darmstadt
Stanford
UC Irvine
Slicon Spice
Malleable
Chameleon Systems
SIDSA
PACT Corp.
16
16 & 32
16 & analog
32
http://hartenstein.de
UC Berkeley (Jan Rabaey)
TU Kaiserslautern
market
project
bits granularity
source
16
UC Berkeley
PADDI
research PADDI-2
Pleiades
© 2004, [email protected]
63
http://hartenstein.de
rDPA
TU Kaiserslautern
commercial rDPA example:
PACT XPP - XPU128
XPP128 rDPA
ALU
• Full 32 or 24 Bit Design working silicon
• 2 Configuration Hierarchies
• Evaluation Board available, and
• XDS Development Tool with Simulator
© 2004, [email protected]
buses
not
shown
Ctrl
CFG
rDPU
PAE
core
© PACT AG, Munich http://pactcorp.com
64
http://hartenstein.de
XPP64A: Platform Development Board
TU Kaiserslautern
- SDR Board In Debug Phase -> XPP64A Chips from STMicro Fab
- Assembly & Test / Available March 2003
© 2004, [email protected]
65
http://hartenstein.de
PACT Corp
TU Kaiserslautern
• Xtreme Processor Platform (XPP) family of IP cores, high-speed
data-stream-capable, scalable, reconfigurable clusters of arrays of
32-bit DPUs with embedded memories, and high-speed I/O ports • Application development support software featuring a flow graphstyle algorithm mapping language - to minimize training requirements.
• XPP's fabrics, featuring automatic DataFlow synchronization and
flagged Event Network to dynamically configure the execution flow,
• Supports dynamic RTR: hierarchical configuration managers free the
designer from chip-level details and ensure that configurations are
independently loaded in exactly the intended order.
• Automatic event-based task swapping along with data streams:
released resources automatically reconfigured immediately
© 2004, [email protected]
66
http://hartenstein.de
Sequential Processor Model
© 2003, PACT AG
TU Kaiserslautern
Conventional processors use the sequential model:
Each operation takes one clock cycle.
Multiple operations are computed consecutively.
Register
Operation 1
Operation 2
Operation 3
Operation 4
Operation 5
Time
© 2004, [email protected]
67
http://hartenstein.de
© 2003, PACT AG
A New Parallel Processor Paradigm
TU Kaiserslautern
Multiple computations are configured as code sections onto
a two dimensional array.
y
Data Buffer
x
© 2004, [email protected]
68 Time
http://hartenstein.de
Parallel Processor Model
© 2003, PACT AG
TU Kaiserslautern
Multiple code sections are computed sequentially.
y
Section 1
x
Operation 2
Section 2
Section 3
© 2004, [email protected]
Time
69
http://hartenstein.de
Dataflow Performance
© 2003, PACT AG
TU Kaiserslautern
Traditional Microprocessor
XPP Architecture
Instruction
Memory and cache
ALU
Configuration
Memory and cache
Register
ADD
MULT
Array of ALUs
One word
Filter
One operation
per cycle
FFT
SHIFT
Basic machine operations
performed on
single words
© 2004, [email protected]
Buffer
Stream
of words
Many
operations
per cycle
Viterbi
70
Complex Functions
performed on
data streams
http://hartenstein.de
instruction stream-based Compilation Principles
TU Kaiserslautern
1-D memory space
source text
parser
library
link/load
instruction call placement
scheduler
execution order by location
© 2004, [email protected]
71
http://hartenstein.de
Datastream-based Compilation Principles
TU Kaiserslautern
library
mapper
placement
& routing
scheduler
data stream assembly
© 2004, [email protected]
72
http://hartenstein.de
TU Kaiserslautern
>> Dual Machine Paradigms <<
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
73
http://hartenstein.de
TU Kaiserslautern
3rd machine model became mainstream
mainframe age
compile
main
frame
instructionstream-based
computer age (PC age)
compile design
µProc. accel.
µProc. rDPA
1967
1957
(Makimtos wave)
© 2004, [email protected]
morphware age
2007
1987
1977
1997
74
http://hartenstein.de
benefit from RAM-based & 2nd paradigm
TU Kaiserslautern
RAM-based platform needed for:
• flexibility, programmability
• avoiding the need of specific silicon
simple 2nd machine paradigm needed as a common model:
• to avoid the need of circuit expertize
• needed to to educate zillions of programmers
what programming source language ?
© 2004, [email protected]
75
http://hartenstein.de
McKinsey Curve: dynamics of R&D disciplines
TU Kaiserslautern
new discipline on top of it by ....
maturity of
a discipline
saturation: limitations met
... by innovation
consolidation
year
fundmental issues
© 2004, [email protected]
76
http://hartenstein.de
EDA Industry Revolutions
TU Kaiserslautern
EDA industry paradigm
switching every 7 years
courtesy
[Keutzer / Newton]
1999
1992
HLLs, (Co-) Compilation
Data-Stream-based DPU arrays
Synthesis: Cadence, Synopsys ...
1985
1978
coming closer to
programmers‘ mind set
2006
Schematics entry: Daisy, Mentor, Valid ...
Transistor entry: Applicon, Calma, CV ...
© 2004, [email protected]
77
http://hartenstein.de
How to achieve acceptance
TU Kaiserslautern
how to hide the ugliness from the user [Herman Schmit]
No hardware description languages
[Courtesy Richard Newton]
Tools usable by users
not being hardware
designers
Courses tailored for
students not being
hardware-savvy
EDA tools based on term rewriting [Arvind] [Mauricio Ayala]
Your name here: your proposals
© 2004, [email protected]
78
http://hartenstein.de
configware compiler
TU Kaiserslautern
source „program“
configware
compiler
anti
machine
© 2004, [email protected]
79
http://hartenstein.de
Configware Compilation
TU Kaiserslautern
memory
bank
data
counter
source „program“
Placement
& routing
mapper
(r)DPA
configware
compiler
asM
asM
asM
asM
data scheduler
asM
80
........
flowware code
© 2004, [email protected]
anti
machine
asM
http://hartenstein.de
TU Kaiserslautern
symbiosis of machine models
source „program“
software
compiler
configware
compiler
µProc.
© 2004, [email protected]
81
anti
machine
http://hartenstein.de
TU Kaiserslautern
symbiosis of machine models
source „program“
partitioning compiler
µProc. r DPA
© 2004, [email protected]
82
http://hartenstein.de
Software / Configware Co-Compilation
Juergen Becker’s CoDe-X, 1996
TU Kaiserslautern
High level PL source
“vN" machine
paradigm
Partitioner
anti machine
paradigm
CW
SW
Analyzer
compiler / Profiler compiler
SW code
© 2004, [email protected]
CW Code
83
supporting
different
platforms
Resource
Parameters
http://hartenstein.de
Loop Transformation Examples
TU Kaiserslautern
sequential processes:
loop 1-16
body
endloop
resource parameter driven
Co-Compilation
host:
loop 1-8
trigger
endloop
loop 1-8
fork
body
body
loop 1-8 loop 9-16
endloop body
body
endloop endloop
loop
unrolling
loop 1-4
trigger
endloop
loop 1-2
trigger
endloop
join
strip mining
© 2004, [email protected]
reconf.array:
84
http://hartenstein.de
>> final remarks <<
TU Kaiserslautern
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
85
http://hartenstein.de
Jürgen Becker
TU Kaiserslautern
Dissertation
Jürgen Becker:
Professor at Univ. Karlsruhe
• ... Automatically partitioning Co-compiler
• (configware / software co-compilation)
• Resource-parameter-driven retargettable
• Profiler-driven optimization
• Accepts HLL „ALE-X“ (extended C subset)
• (subset: pointers not supported)
© 2004, [email protected]
86
http://hartenstein.de
Rainer Kress
TU Kaiserslautern
Dissertation
Rainer Kress: infineon technologies, Munich
• ... on mapping applications onto his* KessArray
• DPSS datapath synthesis system
• Including a data scheduler
• (data stream scheduler)
• Generalization of the Systolic Array
• (KressArray is a super systolic array)
• 32 bit design via Eurochip support
© 2004, [email protected]
87
http://hartenstein.de
TU Kaiserslautern
More Presentations / Literature
• http://hartenstein.de/keynotes.html
• http://hartenstein.de/publications.html
© 2004, [email protected]
88
http://hartenstein.de
TU Kaiserslautern
Antimatter Search ?
in EE & CS we do not need to search
Antimatter Search
© 2004, [email protected]
89
http://hartenstein.de
TU Kaiserslautern
END
© 2004, [email protected]
90
http://hartenstein.de