Coarse Grain Reconfigurable Architectures

Download Report

Transcript Coarse Grain Reconfigurable Architectures

Reconfigurable HPC
July 23, 2004, Fukuoka, Japan
Reiner Hartenstein
TU Kaiserslautern
Reconfigurable
Technologies (2)
TU Kaiserslautern
>> Machine vs. Anti Machine <<
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
• Data Sequencing
• Final Remarks
http://www.uni-kl.de
© 2004, [email protected]
2
http://hartenstein.de
TU Kaiserslautern
Reconfigurable Computing:
a second programming domain
Migration of programming to the structural domain
The structural domain has become RAM-based
The opportunity to introduce
the structural domain to programmers ...
... to bridge the gap by clever abstraction mechanisms
using a simple new machine paradigm
© 2004, [email protected]
3
http://hartenstein.de
RAM
memory
CPU
memory
bank
asM
matter vs.
anti matter data stream machine
data
counter
(anti machine)
DPU
progra
m
counter
(r)DPA
without
sequencer
asM
asM
asM
(r)DPA
asM
........
TU Kaiserslautern
machine vs.
anti machine
© 2004, [email protected]
4
........
asM
instruction stream machine
(von Neumann etc.)
asM: auto-sequencing Memory
asMA: auto-sequencing Memory Array
http://hartenstein.de
TU Kaiserslautern
Data-stream-based Computing
The Anti Universe of Computing
© 2004, [email protected]
5
http://hartenstein.de
The anti universe
TU Kaiserslautern
• Paul Dirac predicted a complete
anti universe consisting of antimatter
• “There are regions in the universe,
which consist of antimatter .....
• .... But there are asymmetries”
• when a particle hits its antiparticle, both
are converted into energy: Annihilation
• We are not aware, that there is a new area in computing
sciences , which consists of antimatter of computing
• Reconfigurable Computing is made from this antimatter:
data-stream-based computing
© 2004, [email protected]
6
http://hartenstein.de
anti particles
TU Kaiserslautern
• 1928: Paul Dirac: „there should be an anti electron
having positive charge“ (Nobel price 1933)
• 1932: Carl David Anderson detected this „positron“
in cosmic radiation (Nobel price 1936)
hydrogen
• 1954: new accelerators: cyclotron,
like Berkeley‘s Bevatron
anti hydrogen
• 1955 Owen Chamberlain et al.
create anti proton on Bevatron
• 1956: anti neutron created on Bevatron
• 1965: creation of a deuterium
anti nucleus at CERN
• 1995: hydrogen anti atom created at
CERN – by forcing positron and anti
proton to merge by very low energy.
© 2004, [email protected]
7
http://hartenstein.de
Matter & Antimatter: Atom and Anti Atom
TU Kaiserslautern
-
+
Anti Matter machine paradigm:
Anti Atom
The World of Matter machine paradigm:
the Atom
© 2004, [email protected]
8
+
http://hartenstein.de
TU Kaiserslautern
Matter & Antimatter of Informatics :
Machine and Anti Machine
CPU
-
+
Anti Machine paradigm
1936
1946
1971
1979
1990
1995
Machine paradigm:
„von Neumann“
© 2004, [email protected]
1st electronic computer (Konrad Zuse)
v. N. machine paradigm
1st microprocessor (Ted Hoff)
„data streams“ (systolic array: Kung / Leiserson)
anti machine paradigm published
rDPA / DPSS (supersystolic: Rainer Kress)
novel
compilation
techniques
9
DPU
+
-
http://hartenstein.de
Matter vs. antimatter: CPU vs. DPU
CPU
+
+
-
CPU
DPU
progra
m
counter
© 2004, [email protected]
instruction
stream
10
data streams
stream
+
-
DPU
+
TU Kaiserslautern
(r)DPA
(r)DPU
without
sequencer
http://hartenstein.de
heavy anti atoms: DPA = DPU array
coherent data streams
spinning around
TU Kaiserslautern
+
DPU
DPU
DPU
+
DPU
DPU
DPU
DPU
DPU
-
+
+
-
-
-
-
-
11
+
DPU
DPA
-
© 2004, [email protected]
+
+
-
-
DPA
+
-
+
http://hartenstein.de
Parallelism by Concurrency
TU Kaiserslautern
independent instruction streams
difficult ...
+
+
-
© 2004, [email protected]
+
+
-
+
12
-
+
-
+
http://hartenstein.de
>> Terminology <<
TU Kaiserslautern
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
• Data Sequencing
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
13
http://hartenstein.de
TU Kaiserslautern
Configware and Flowware ...
... are the sources for programming morphware.
Software is the source for programming traditional
hardwired processors (instruction-stream-driven:
von Neumann machine paradigm and its derivatives)
For Configware and Flowware we prefer the
anti machine paradigm – conterpart of von Neumann
© 2004, [email protected]
14
http://hartenstein.de
TU Kaiserslautern
Terminology: Digital System Platforms
clearly distinguished
source
running on it
platform
hardware
(not running on it)
fine grain rGA (FPGA)
configware
morphware coarse
rDPU, rDPA
grain
reconfigurable flowware &
data stream
configware
processor
data stream processor (hardwired)
flowware
instruction stream processor
software
© 2004, [email protected]
machine
paradigm
15
none
anti machine
von Neumann
machine
http://hartenstein.de
TU Kaiserslautern
Terminology: Digital System Platforms
clearly distinguished
source
running on it
platform
hardware
(not programmable)
fine grain rGA (FPGA)
configware
morphware coarse
rDPU, rDPA
grain
reconfigurable flowware &
data stream
configware
processor
data stream processor (hardwired)
flowware
instruction stream processor
software
© 2004, [email protected]
machine
paradigm
16
none
anti machine
von Neumann
machine
http://hartenstein.de
Importance of binding time
TU Kaiserslautern
not all switching is done
by Configware
0
c 1
time of
“instruction read new read new
fetch” instruction instruction
Microprocessor
run time
Parallel Computer
load time
Reconfigurable
for
a
pipe
network
Configuration: like
Computing
a kind of pre-packed
compile time
frozen-in „super
instruction fetch“
configure
datapaths
fabrication time
fabricate a
datapath
© 2004, [email protected]
17
c
0
0
1
1
c
Full custom
oder ASIC
http://hartenstein.de
Software vs Flowware and Configware
TU Kaiserslautern
Programming source for instruction-stream-based
computing (von Neumann etc.):
Software
The programming source for data-stream-based
computing operations (the anti machine paradigm):
Flowware
Programming sources for
Reconfigurable Computing
(morphware):
µProc.
d.schedule
hdw.anti
machine
compile
Flowware and Configware
Sources for Embedded Systems:
Flowware, Configware & Software
© 2004, [email protected]
compile
18
rec.anti
machine
partit. compiler
µProc.
rec.anti
machine
http://hartenstein.de
control-procedural vs. data-procedural
TU Kaiserslautern
The structural domain is primarily data-stream-based:
Flowware
..... mostly not yet modelled that way:
most flowware is hidden by its indirect
instruction-stream-based implementation
Flowware converts „procedural vs. structural“
into „control-procedural vs. data-procedural“ ...
© 2004, [email protected]
19
http://hartenstein.de
TU Kaiserslautern
data streams*: not new
1980: data streams (Kung, Leiserson: systolic arrays)
1989: data-stream-based Xputer architecture
1990: rDPU (Rabaey)
1994: Flowware Language MoPL (Becker et al.)
1995: super systolic array (rDPA) + DPSS tool (Kress)
1996+: Streams-C language, SCCC (Los Alamos), SCORE,
ASPRC, Bee (UC Berkeley), DSP-C, Brook, ...
1996: configware / software partitioning compiler (Becker)
© 2004, [email protected]
20
http://hartenstein.de
URLs: Software vs Flowware and Configware
TU Kaiserslautern
http://morphware.net/
http://configware.org/
http://flowware.net/
http://data-streams.org/
http://anti-machine.org/
http://kressarray.de/
compile
µProc.
d.schedule
compile
hdw.anti
machine
rec.anti
machine
partit. compiler
µProc.
© 2004, [email protected]
21
rec.anti
machine
http://hartenstein.de
HPC going configware
TU Kaiserslautern
International Conference on
Field-Programmable Logic
and Applications (FPL)
http://fpl.org
Aug. 20 – Sept 1, 2004, Antwerp, Belgium
... going into every type of application
µProc. accel.
© 2004, [email protected]
288 submissions !
they all work on high http://hartenstein.de
performance
22
TU Kaiserslautern
>> Morphware Platforms <<
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
• Data Sequencing
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
23
http://hartenstein.de
Fine-grain Morphware: Drawbacks
TU Kaiserslautern
• FPGA Architectures
– SRAM-based Look-up Tables (LUTs)
– Problems:
• Routing: reduces Performance
• Bad Ratio: active / passive Elements
reconfigurable Interconnect (Switching Boxes)
Configurable Logic
Block (CLB)
LUT
© 2004, [email protected]
24
Source: R. Hartenstein
http://hartenstein.de
Reconfigurability Overhead
TU Kaiserslautern
area used by
application
L
L
partly for
configuration
code storage
S
L
© 2004, [email protected]
S
L
resources
needed for
reconfigurability
“hidden RAM”
not shown
L
S
L
S
L
25
L
L
http://hartenstein.de
Throughput vs. Efficiency
TU Kaiserslautern
area used by
application
T. Claasen et al.: ISSCC 1999
*) R. Hartenstein: ISIS 1997
MOPS / mW
1000
L
100
L
L
L
S
1
L
S
L
L
resources
needed for
reconfigurability
0.01
0.001
L
1 Bit CLB
0.1
Wiring by abutment:
32 Bit example
S
S
10
L
2
© 2004, [email protected]
1
0.5
26
0.25
0.13 0.1 0,07 µ feature size
http://hartenstein.de
One more argument for coarse grain
TU Kaiserslautern
T. Claasen et al.: ISSCC 1999
*) R. Hartenstein: ISIS 1997
MOPS / mW
1000
100
10
1
0.1
Wiring by abutment:
a 32 Bit KressArray
example
0.01
0.001
2
© 2004, [email protected]
1
0.5
0.25
27
if coarse grain cells
are full custom and
mesh-connected,
and 2nd level
interconnect
ressources layouted
over the cells
the array is
almost as
area-efficient
as hardwired
0.13 0.1 0,07 µ feature size
http://hartenstein.de
It’s a Paradigm Shift !
TU Kaiserslautern
• Using FPGAs (fine grain reconfigurable) just
mainly has been classical Logic Synthesis on
a “strange hardware” platform
• Coarse Grain Reconfigurable Arrays (rDPAs)
(Reconfigurable Computing), however,
mean a really fundamental Paradigm Shift
• This is still ignored by CS and EE
Curricula and almost all R&D scenes
© 2004, [email protected]
28
http://hartenstein.de
TU
Kaiserslautern
System
gates
10 000 000
Mega-rGAs
per rGA chip
[Xilinx Data]
planned
Virtex II
1 000 000
Virtex
XC 40250XV
XC 4085XL
100 000
10 000
1 000
500
200
100
1984
1986
1988
© 2004, [email protected]
1990
1992
1994
29
1996
1998
2000
Jahr
2002
2004
http://hartenstein.de
TU Kaiserslautern
entire system on a single chip
• Xilinx Virtex-II Pro
FPGA Architecture
• PowerPC 405
RISC CPU
(PPC405) cores
• FPGA Fabric-based
on Virtex-II
Architecture
all you need on board
Rocket
IO
Power PC
Core
On Chip
Memory
Controller
Embeded
RAM
Source: Ivo Bolsens, Xilinx
© 2004, [email protected]
30
http://hartenstein.de
[ST microelectronics]
Mask & NRE cost
TU Kaiserslautern
© 2004, [email protected]
31
http://hartenstein.de
>> coarse-grained Platforms <<
TU Kaiserslautern
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
• Data Sequencing
http://www.uni-kl.de
• Final Remarks
© 2004, [email protected]
32
http://hartenstein.de
Computing in space and time
TU Kaiserslautern
this dichotomy is
completely ignored
by our CS curricula
y1
-
x3
y2
y3
- -
a13
a23
a33
-
a12
a22
a32
x1
a11
a21
a31
-
-
x2
y 1(0)
y 2( 0)
y 3( 0)
© 2004, [email protected]
placement
computing
computing systolic
arrays
in space
in time
etc.
data
streams
33
migration by re-timing
and other transformations
http://hartenstein.de
Generalized Stream-based Computing System
TU Kaiserslautern
heterogenous Array of rDPUs (reconf. data path units)
The same mapper for both:
Reconfigurable,
or hardwired
Kress DPSS [1995]
y
a
*
+
DPU architectures
x expression tree
1
2
3
4
+ + * xf
sh sh * + + * xf
sh sh
* -
© 2004, [email protected]
simultaneous
placement
& routing
data
streams
34
Mapper
Scheduler
Configware
Compiler
http://hartenstein.de
Supersystolic Array Principles
TU Kaiserslautern
• take systolic array principles
• replace classical synthesis by simulated annealing
• yields the supersystolic array
• a generalization of the systolic array
• no more restricted to regular data dependencies
• now reconfigurability makes sense: use morphware
© 2004, [email protected]
35
http://hartenstein.de
flowware history
TU Kaiserslautern
time
flowware history:
1980: data streams
(Kung, Leiserson)
time
1995: super systolic
rDPA (Kress)
1996+: SCCC (LANL),
SCORE, ASPRC,
Bee (UCB), ...
x
x
x
DPA
|
|
port #
- - - x x x
time
- - - - x x x
x x x - -
- - - - - x x x
port #
|
|
|
|
|
|
|
|
|
|
|
x
x
x
36
input data streams
|
x x x
x x x -
(tutorials and courses available on all this)
© 2004, [email protected]
x
x
x
x
x
x
time
x
x
x
port #
output data streams
|
x
x
x
http://hartenstein.de
computing paradigms and methodologies
TU Kaiserslautern
1946: machine paradigm (von Neumann)
1989: anti machine paradigm
1990: rDPU (Rabaey)
1994: anti machine high level programming language
1995: super systolic array (rDPA)
flowware*
1980: data streams (Kung, Leiserson)
1996+: SCCC (LANL), SCORE, ASPRC, Bee (UCB), ...
1997+: discipline of distributed memory architecture
1997: configware / software partitioning compiler
© 2004, [email protected]
37
http://hartenstein.de
Super Pipe Networks
TU Kaiserslautern
array
systolic
array
applications
regular data
dependencies
only
supersystolic
rDPA
*
pipeline properties
shape
resources
linear
only
uniform
only
mapping
linear projection or
algebraic synthesis
simulated
annealing or
P&R algorithm
no restrictions
scheduling
(data stream
formation)
(e.g. force-directed)
scheduling
algorithm
*) KressArray [1995]
© 2004, [email protected]
38
http://hartenstein.de
Programming Language Paradigms
TU Kaiserslautern
language category
both deterministic
operation
sequence
driven by:
state register
address
computation
Von Neumann Languages
Anti Machine Languages
procedural sequencing: traceable, checkpointable
read next instruction,
read next data item,
goto (instr. addr.),
goto (data addr.),
jump (to instr. addr.),
jump (to data addr.),
instr. loop, loop nesting
data loop, loop nesting,
no parallel loops, escapes,
parallel loops, escapes,
instruction stream branching data stream branching
program counter
data counter(s)
massive memory
overhead avoided
cycle overhead
Instruction fetch
parallel memory
bank access
memory cycle overhead
overhead avoided
interleaving only
no restrictions
language features
control flow +
data manipulation
data streams only
(no data manipulation)
© 2004, [email protected]
39
http://hartenstein.de
Similar Programming Language Paradigms
TU Kaiserslautern
language category
both deterministic
sequencing
driven by:
© 2004, [email protected]
Computer Languages
Xputer Languages
procedural sequencing: traceable, checkpointable
read next instruction,
read next data object,
goto (instruction addr.),
goto (data addr.),
jump (to instruction addr.),
jump (to data addr.),
instruction loop,
data loop,
instruction loop nesting
data loop nesting,
no parallel loops,
parallel data loops,
instruction loop escapes,
data loop escapes,
instruction stream branching data stream branching
40
http://hartenstein.de
*> Declarations
TU
Kaiserslautern
SouthWestScan
is
loop 8 times until [1,*]
step by [-1,1]
endloop
end SouthWestScan;
JPEG zigzag scan pattern
Flowware language example
HalfZigZag;
SouthWestScan
(MoPL)
uturn (HalfZigZag)
goto PixMap[1,1]
SouthScan is
step by [0,1]
endSouthScan;
NorthEastScan is
loop 8 times until [*,1]
step by [1,-1]
endloop
end NorthEastScan;
x
y
dataHalfZigZag
counter
data counter
data counter
data counter
EastScan is
step by [1,0]
end EastScan;
© 2004, [email protected]
41
HalfZigZag
HalfZigZag is
EastScan
loop 3 times
SouthWestScan
SouthScan
NorthEastScan
EastScan
endloop
end HalfZigZag;
http://hartenstein.de
Machine Paradigms
TU Kaiserslautern
machine category
Computer (the Machine:
“v. Neumann”)
driven by:
Instruction streams
data streams (no “dataflow”)
engine principles
instruction sequencing
sequencing data streams
state register
single program counter
(multiple) data counter(s)
at run time
at load time
resource
DPU (e.g. single ALU)
DPU or DPA (DPU array) etc.
operation
sequential
parallel pipe network etc.
Communication path set-up
. fetch” )
( “instruction
data
path
*) e g. Bee project Prof. Broderson
© 2004, [email protected]
The Anti Machine
also hardwired implementations*
42
http://hartenstein.de
rDPA
mapping algorithms efficently onto rDPA
TU Kaiserslautern
SNN filter on KressArray
rout thru only
array size:
10 x 16
= 160 rDPUs
Legend:
rDPU not used
backbus connect
used for
routing only
backbus
connect
operator and routing
port location
not
usedmarker
by the way: example of scalability / relocatability by EDA support
also FPGA scalability (avoid routing congestion) by EDA solution
© 2004, [email protected]
43
http://hartenstein.de
Xplorer Plot: SNN Filter Example
TU Kaiserslautern
http://kressarray.de
2 hor. NNports, 32 bit
3 vert. NNports, 32 bit
route-thru-only rDPU
© 2004, [email protected]
[13]
+
result
operand
44
operator
operand
route thru
backbus connect
http://hartenstein.de
>>> distributed memory
TU Kaiserslautern
The new discipline of
(application-specific)
distributed
memory
[] Herz et al.: proc. IEEE ICECS 2002
© 2004, [email protected]
45
http://hartenstein.de
TU Kaiserslautern
Synthesizable distributed memory architecture...
for a Stream-based Soft Machine
“instructions”
rDPA
Compiler
Memory
Scheduler
(data memory)
memory bank
memory bank
memory bank
...
memory bank
...
memory bank
Sequencers
(data stream
generator)
© 2004, [email protected]
46
http://hartenstein.de
Synthesizable Distributed Memory
TU Kaiserslautern
An example by
Nageldinger’s
KressArray
Xplorer
Efficient Memory
Communication
should be directly
supported by the
Mapper Tools
Legend:
Optimized
Parallel
memory ports Memory
Controller
sequencers
application
not used
http://kressarray.de
© 2004, [email protected]
47
http://hartenstein.de
rDPA
TU Kaiserslautern
commercial rDPA example:
PACT XPP - XPU128
XPP128 rDPA
ALU
• Full 32 or 24 Bit Design working silicon
• 2 Configuration Hierarchies
• Evaluation Board available, and
• XDS Development Tool with Simulator
© 2004, [email protected]
buses
not
shown
Ctrl
CFG
rDPU
PAE
core
© PACT AG, Munich http://pactcorp.com
48
http://hartenstein.de
XPP64A: Platform Development Board
TU Kaiserslautern
- SDR Board In Debug Phase -> XPP64A Chips from STMicro Fab
- Assembly & Test / Available March 2003
© 2004, [email protected]
49
http://hartenstein.de
Dataflow Performance
© 2003, PACT AG
TU Kaiserslautern
Traditional Microprocessor
XPP Architecture
Instruction
Memory and cache
ALU
Configuration
Memory and cache
Register
ADD
MULT
Array of ALUs
One word
Filter
One operation
per cycle
FFT
SHIFT
Basic machine operations
performed on
single words
© 2004, [email protected]
Buffer
Stream
of words
Many
operations
per cycle
Viterbi
50
Complex Functions
performed on
data streams
http://hartenstein.de
>> Data Sequencing <<
TU Kaiserslautern
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
• Data Sequencing
http://www.uni-kl.de
• Final Remarks
© 2004, [email protected]
51
http://hartenstein.de
TU Kaiserslautern
>> Dual Machine Paradigms <<
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
• Data Sequencing
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
52
http://hartenstein.de
TU Kaiserslautern
3rd machine model became mainstream
mainframe age
compile
main
frame
instructionstream-based
computer age (PC age)
compile design
µProc. accel.
µProc. rDPA
1967
1957
(Makimtos wave)
© 2004, [email protected]
morphware age
2007
1987
1977
1997
53
http://hartenstein.de
benefit from RAM-based & 2nd paradigm
TU Kaiserslautern
RAM-based platform needed for:
• flexibility, programmability
• avoiding the need of specific silicon
simple 2nd machine paradigm needed as a common model:
• to avoid the need of circuit expertize
• needed to to educate zillions of programmers
what programming source language ?
© 2004, [email protected]
54
http://hartenstein.de
McKinsey Curve: dynamics of R&D disciplines
TU Kaiserslautern
new discipline on top of it by ....
maturity of
a discipline
saturation: limitations met
... by innovation
consolidation
year
fundmental issues
© 2004, [email protected]
55
http://hartenstein.de
EDA Industry Revolutions
TU Kaiserslautern
EDA industry paradigm
switching every 7 years
courtesy
[Keutzer / Newton]
1999
1992
HLLs, (Co-) Compilation
Data-Stream-based DPU arrays
Synthesis: Cadence, Synopsys ...
1985
1978
coming closer to
programmers‘ mind set
2006
Schematics entry: Daisy, Mentor, Valid ...
Transistor entry: Applicon, Calma, CV ...
© 2004, [email protected]
56
http://hartenstein.de
How to achieve acceptance
TU Kaiserslautern
how to hide the ugliness from the user [Herman Schmit]
No hardware description languages
[Courtesy Richard Newton]
Tools usable by users
not being hardware
designers
Courses tailored for
students not being
hardware-savvy
EDA tools based on term rewriting [Arvind] [Mauricio Ayala]
Your name here: your proposals
© 2004, [email protected]
57
http://hartenstein.de
configware compiler
TU Kaiserslautern
source „program“
configware
compiler
anti
machine
© 2004, [email protected]
58
http://hartenstein.de
Configware Compilation
TU Kaiserslautern
memory
bank
data
counter
source „program“
Placement
& routing
mapper
(r)DPA
configware
compiler
asM
asM
asM
asM
data scheduler
asM
59
........
flowware code
© 2004, [email protected]
anti
machine
asM
http://hartenstein.de
TU Kaiserslautern
symbiosis of machine models
source „program“
software
compiler
configware
compiler
µProc.
© 2004, [email protected]
60
anti
machine
http://hartenstein.de
TU Kaiserslautern
symbiosis of machine models
source „program“
partitioning compiler
µProc. r DPA
© 2004, [email protected]
61
http://hartenstein.de
Software / Configware Co-Compilation
Juergen Becker’s CoDe-X, 1996
TU Kaiserslautern
High level PL source
“vN" machine
paradigm
Partitioner
anti machine
paradigm
CW
SW
Analyzer
compiler / Profiler compiler
SW code
© 2004, [email protected]
CW Code
62
supporting
different
platforms
Resource
Parameters
http://hartenstein.de
Loop Transformation Examples
TU Kaiserslautern
sequential processes:
loop 1-16
body
endloop
resource parameter driven
Co-Compilation
host:
loop 1-8
trigger
endloop
loop 1-8
fork
body
body
loop 1-8 loop 9-16
endloop body
body
endloop endloop
loop
unrolling
loop 1-4
trigger
endloop
loop 1-2
trigger
endloop
join
strip mining
© 2004, [email protected]
reconf.array:
63
http://hartenstein.de
History of Loop Transformations
TU Kaiserslautern
Loop Unrolling, Loop Fusion, Strip Mining ....
David Loveman, 1977, Allen and Kennedy, et al.
70ies - 80ies: at Process Level:
• Sequential to Parallel Processes, incl. Vectorization
1995/97 [Karin Schmidt / Jürgen Becker]: downto Datapath Level:
• (Parameter-driven) Time to Time/Space Partitioning
e. g.: Transformation from Sequential Process to Super-systolic
2000 [Michael Herz]: optimized RA to Memory Communication Bandwidth:
• Multi-dimensional Loop Unrolling / Storage Scheme Optimization
supporting burst-mode & parallel Memory Banks
© 2004, [email protected]
64
http://hartenstein.de
>> Data Sequencing <<
TU Kaiserslautern
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
• Data Sequencing
http://www.uni-kl.de
• Final Remarks
© 2004, [email protected]
65
http://hartenstein.de
application-specific distributed memory*
TU Kaiserslautern
• Application-specific memory: rapidly growing markets:
– IP cores
– Module generators
– EDA environments
• Optimization of memory bandwidth for
application-specific distributed memory
• Power and area optimization as a further benefit
• Key issues of address generators will be discussed
*) see books by Francky Catthoor et al.
© 2004, [email protected]
66
http://hartenstein.de
Significance of Address Generators
TU Kaiserslautern
• Address generators have the potential to reduce
computation time significantly.
• In a grid-based design rule check a speed-up of
more than 2000 has been achieved, compared to a
VAX-11/750
• Dedicated address generators contributed a
factor of 10 - avoiding memory cycles for address
computation overhead
© 2004, [email protected]
67
http://hartenstein.de
Smart Address Generators
TU Kaiserslautern
1983 The Structured Memory Access (SMA) Machine
1984 The GAG (generic address generator)
1989 Application-specific Address Generator (ASAG)
1990 The slider method: GAG of the MoM-2 machine
1991 The AGU
1994 The GAG of the MoM-3 machine
1997 The Texas Instruments TMS320C54x DSP
1997 Intersil HSP45240 Address Sequencer
1999 Adopt (IMEC)
© 2004, [email protected]
68
http://hartenstein.de
Adopt (from IMEC)
TU Kaiserslautern
•customized MMU (cMMU)
• address expression (AE)
•Address Sequence (AS)
•Address Calculation Unit (ACU)
• Application-Specific Unit (ASU)
•cMMU synthesis environment:
•application-specific ACUs for array index reference
•ACU as a counter modified by multi-level logic filter
•ACU with ASUs from a Cathedral-3 library
•distributed ACU alleviates interconnect overhead (delay, power, area)
•nested loop minimization by algebraic transformations
•AE splitting/clustering
•AE multiplexing to obtain interleaved ASs
•other features
© 2004, [email protected]
For more details on Adopt see
paper in proceedings CD-ROM
69
http://hartenstein.de
Distributed Memory
TU Kaiserslautern
SA: scrambling and descrambling the data ?
Just in time: a new research area:
Application-specific distributed memory:
e. g. book by F. Catthoor et al. ...
Data address generators - 20 years research:
© 2004, [email protected]
70
http://hartenstein.de
Generic Sequence Examples
TU Kaiserslautern
L0 DA B0
atomic scan
linear scan
a)
Address
Stepper
Limit
Slider
video scan
b)
A
Base
Slider
GAU
-90º rotated video scan
c)
-45º rotated (mirx (v scan))
until
sheared video scan
non-rectangular video scan
zigzag video scan
d)
e)
f)
g)
spiral scan
feed-back-driven scans
perfect
shuffle
© 2004, [email protected]
71
http://hartenstein.de
TU Kaiserslautern
GAU generic address unit Scheme
GAG = Generic
Address
Generatorc
DA
B0
[|
L0
Limit
Slider
GAU
© 2004, [email protected]
DA
72
|
|
]
limit
B0
Address
Stepper
A
|
L
Base
Slider
all 3 are copies
of the same BSU
stepper circuit
http://hartenstein.de
GAG: Address Stepper
GAG: Address Stepper
TU Kaiserslautern
]
[
Base
B0
Limit
GAG =
Generic
Address
Generator
[|
DA
|
|
stepVector
maxStepCount
init
tag
L
B0
|
DA
A
Step
Counter
+/–
=o
Escape
Clause
End
Detect
L
|
|
]
limit
A
Address
© 2004, [email protected]
73
endExec
http://hartenstein.de
TU Kaiserslautern
GAG Slider Model
floor
DA
L0
Limit
Stepper
B0
Address
Stepper
DA
[
B0
Generic
Address
Generator
L0
]
DA
L0
[
© 2004, [email protected]
Base
Stepper
GAG
A
B0
ceiling
sliders
]
74
http://hartenstein.de
GAG Slider Operation Demo
L0 DA B0
TU Kaiserslautern
Limit
Slider
Address
Stepper
Base
Slider
GAG
A
address
floor
F
ceiling
B0
DA
DB
x
© 2004, [email protected]
y
DB
75
L0
C
DL
DL
http://hartenstein.de
GAG Complex Sequencer Implementation
TU Kaiserslautern
GAG
GAG
L0 DA B0
Limit
Slider
Address
Stepper
A
VLIW
stack
L0 DA B0
Base
Slider
Limit
Slider
Address
Stepper
GAG
A
L0 DA B0
Limit
Slider
GAU
Address
Stepper
A
GAU
GAU
SDS
Base
Slider
GAG
Generic Addressing Unit
© 2004, [email protected]
Base
Slider
76
all `been
published
in 1990
http://hartenstein.de
Speedup by MoM
TU Kaiserslautern
grid-based design
rule check example
speed-up: >1000
complex boolean
expressions in
1 clock cycle
© 2004, [email protected]
data
counter
asM
MoM anti machine
(r)DPU
example:
4x4 scan
window
77
smart
memory
interface
asM
asM
asM
asM
......
address
computation
overhead: 94 %
asM
asMA distributed memory
MoM architecture:
2-D memory space,
adj. scan window
memory
bank
http://hartenstein.de
Xputer Lab at Kaiserslautern: MoM I and II
TU Kaiserslautern
© 2004, [email protected]
78
http://hartenstein.de
Antimachine: MoM architecture
TU Kaiserslautern
Handle Position Generator
scan window
y
intra scan window accesses
(low level sequencing)
example
y-GAG
x-GAG
handleposition
Scan Window Generator
bank 0 1 • • • n
scan pattern
(high level sequencing)
memory accesses
x
handle positions
© 2004, [email protected]
79
http://hartenstein.de
Vary-size scan windows
TU Kaiserslautern
Size adjustable at run time
square or rectangular shape
location‘s individual access mode: R, W, R/W, no-op
by no-op placements any wild window shape
avoid multiple read/multiple write for
overlapping successive scan window positions
© 2004, [email protected]
80
http://hartenstein.de
Linear Filter Application
TU Kaiserslautern
b)
w/r r
r/w
r
r
r
r
r
r
r
r
w/r
r
r
Bank a
r
r
r
r
Bank b
r
r
r
r
r
r
r
Bank a
r
r
r
r
w
r
scan step
© 2004, [email protected]
81
http://hartenstein.de
Scanline unrolling
TU Kaiserslautern
© 2004, [email protected]
82
r/w
r
r
r/w
r
r
r/w
r
r
r
r
r
r
r
r
http://hartenstein.de
90o Rotation of Scan Pattern
TU Kaiserslautern
r
r
r
r
r
Bank b
r
r
r
r
r
Bank a
w
r
w
r
w
r
w
r
r
r
r
r
w
r
w
r
Bank b
Bank a
r
r
r
r
r
r
r
r
r
r
Bank b
r
r
r
r
r
r
r
r
r
r
Bank a
r
r
r/w
r/w
r/w
w
w
w
Bank b
r
r
r/w
r/w
r/w
w
w
w
Bank a
© 2004, [email protected]
83
scan
window
overlap
area
http://hartenstein.de
Linear Filter Application
TU Kaiserslautern
Parallelized Merged Buffer Linear Filter Application
with example image of x=22 by y=11 pixel
final design
after inner scan
line loop unrolling
after scan
line unrolling
hardw. level
access optim.
initial design
© 2004, [email protected]
84
http://hartenstein.de
Multiple Scan Windows
TU Kaiserslautern
memory
bank
asM
MoM anti machine
an Xputer architecture
rDPU
© 2004, [email protected]
.....
85
asM
asM
asM
asM
......
example:
4x4 scan
windows
smart
memory
interface
asM
asMA distributed memory
data
counter
http://hartenstein.de
16 point CGFFT: mapped onto 2-D memory space
TU Kaiserslautern
© 2004, [email protected]
86
http://hartenstein.de
CGFFT: Nested and Parallel Scan Pattern
coeff.
output
coeff.
temp
coeff.
temp
input
coeff.
temp
TU Kaiserslautern
MAC
ini coeff.
ini+1 empty
© 2004, [email protected]
87
http://hartenstein.de
CGFFT: Parallel Scan Pattern Animation
TU Kaiserslautern
outj
MAC
outk
ini coeff.
ini+1 empty
© 2004, [email protected]
88
http://hartenstein.de
CGFFT: Parallel Scan Pattern Animation
TU Kaiserslautern
MAC
outj
4 MAC units
outj+1
in
parallel
in
coeff.
i
ini+1 empty
In8i+2MAC
coeff.units
parallel
inin
i+3 empty
outk
MAC
outk+1
© 2004, [email protected]
89
http://hartenstein.de
TU Kaiserslautern
CGFFT: Nested and Parallel Scan Pattern
HLScan is 3 steps [2, 0]
outer loop
scan
pattern
SP1 is 7 steps [0, 2]
goto
SP23 is 7 steps [0, 1]
inner loop
compound
3 in parallel
scan
patterns
© 2004, [email protected]
90
http://hartenstein.de
>> final remarks <<
TU Kaiserslautern
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
91
http://hartenstein.de
Jürgen Becker
TU Kaiserslautern
Dissertation
Jürgen Becker:
Professor at Univ. Karlsruhe
• (configware / software co-compilation)
• Resource-parameter-driven retargettable
• ... Automatically partitioning Co-compiler
• Profiler-driven optimization
• Accepts HLL „ALE-X“ (extended C subset)
• (subset: pointers not supported)
© 2004, [email protected]
92
http://hartenstein.de
Rainer Kress
TU Kaiserslautern
Dissertation
Rainer Kress: infineon technologies, Munich
• ... on mapping applications onto his* KessArray
• DPSS datapath synthesis system
• Including a data scheduler
• (data stream scheduler)
• Generalization of the Systolic Array
• (KressArray is a super systolic array)
• 32 bit design via Eurochip support
© 2004, [email protected]
93
http://hartenstein.de
Ulrich Nageldinger
TU Kaiserslautern
Dissertation
Ulrich Nageldinger: infineon technologies, Munich
• Coarse-grained Reconfigurable Architectures
Design Space Exploration; Dissertation, 2001
© 2004, [email protected]
94
http://hartenstein.de
Michael Herz
TU Kaiserslautern
Dissertation
Michael Herz:
Agilent, Sindelfingen, Germany
• High Performance Memory Communication
Architectures for Coarse-grained Reconfigurable
Computing Systems; Dissertation 2001
© 2004, [email protected]
95
http://hartenstein.de
TU Kaiserslautern
More Presentations / Literature
• http://hartenstein.de/keynotes.html
• http://hartenstein.de/publications.html
© 2004, [email protected]
96
http://hartenstein.de
TU Kaiserslautern
Antimatter Search ?
in EE & CS we do not need to search
Antimatter Search
© 2004, [email protected]
97
http://hartenstein.de
TU Kaiserslautern
END
© 2004, [email protected]
98
http://hartenstein.de