Coarse Grain Reconfigurable Architectures

Download Report

Transcript Coarse Grain Reconfigurable Architectures

Prof. Ahmed Hamani,
Systems Architecture and Methodology,
Electronic, Computer and Software Systems,
School of Information and Communication Technology,
KTH - Royal Institute of Technology,
Stockholm, 11. September 2009
Reiner Hartenstein
TU Kaiserslautern
Programmer Education
for the Multicore Era:
the Twin-paradigm approach
>> Outline <<
http://hartenstein.de
[email protected]
• The single-paradigm dictatorship
• Von Neumann vs. FPGA
• The Datastream Machine Model
• Avoiding address computation overhead
• The twin Paradigm approach
• Conclusions
© 2009, [email protected]
2
http://hartenstein.de
http://hartenstein.de
[email protected]
The spirit of the Mainframe Age
•For decades, we’ve trained programmers to think sequentially,
breaking complex parallelism down into atomic instruction steps …
•… finally tending to code sizes of astronomic dimensions
•Even in “hardware” courses (unloved child of CS scenes) we mostly
teach von Neumann machine design – deepening this tunnel view
•1951: Hardware Design going von Neumann (Microprogramming)
© 2009, [email protected]
3
http://hartenstein.de
http://hartenstein.de
[email protected]
Program Performance
[J. Larus: Spending Moore's
Dividend; C_ACM, May 2009]
„Multicore computers shift the burden of software
performance from chip designers to programmers.“
... performance drops & other problems
in moving single-core to multicore ...
The law of Moore?
No, the law of
More
Massively decreasing programmer productivity
supercomputing: multi-disciplinary multi-location team
works several years: software ready – hardware obsolete
Missing programmer population and methodology:
a scenario like before the Mead-&-Conway revolution
© 2009, [email protected]
4 4
http://hartenstein.de
http://hartenstein.de
[email protected]
Why we went multicore
Four walls:
• Instruction-level parallelism
• Memory
• Power
• Complexity
Multicores promised to remove the walls
Thousands of cores; boy, so many challenges…
© 2009, [email protected]
5
http://hartenstein.de
http://hartenstein.de
[email protected]
More cores instead of faster cores
Avoiding the decline from growth industry
to replacement business ?
no, not without redefinition of the field
not under the single-paradigm dictatorship
sequential-only mind set dominating
parallel algorithms mostly missing
very difficult to program
useful abstractions mostly missing
© 2009, [email protected]
6
http://hartenstein.de
http://hartenstein.de
[email protected]
Very difficult to program
Programing skills needed going
far beyond sequential programing
Massive Synchronization overhead
Race conditions: reasons of bugs
Non-determinism: new types of bugs
Language and tool support missing
© 2009, [email protected]
7
http://hartenstein.de
http://hartenstein.de
[email protected]
useful abstractions mostly missing
Parallelism models machine-specific and low level
shared memory use or message passing (hardware features)
Parallel programming at assembly language level
multi-threading, semaphores, locking (compaer & swap)
Performance models are machine-specific
Problems in portability, investment reuse, exonomics of scale
© 2009, [email protected]
8
http://hartenstein.de
http://hartenstein.de
[email protected]
Which programming model to use?
Stubborn consensus on the von Neumann paradigm
its enforced monopoly-like dominance is the key problem
it is incredibly inefficient (the von Neumann syndrome)
No consensus on parallelism model
data parallelism, message passing, (unstructured) ulti-threading, or ?
Many applications use all three or even more
Language & tool support needed to integrate models
unqualified programmer population: Education reform needed
© 2009, [email protected]
9
http://hartenstein.de
http://hartenstein.de
[email protected]
Program Performance
„Multicore computers shift the burden of software
performance from chip designers to programmers.“
... performance drops & other problems
in moving single-core to multicore ...
[J. Larus: Spending
Moore's Dividend; C_ACM,
May 2009]
Since People have to write code differently,
we anyway need a Software Education Revolution ...
... the chance to move RC* from niche to mainstream
Missing programmer population and methodology:
a scenario like before the Mead-&-Conway revolution
© 2009, [email protected]
10 10
*) RC = Reconfigurable
Computing
http://hartenstein.de
http://hartenstein.de
[email protected]
Power Consumption of Computers
... has become an industry-wide issue:
incremental improvements are on track,
(plain Green Computing)
but „we may ultimately need
revolutionary new solutions“ [Horst Simon, LBNL, Berkeley]
Twin Paradigm Green Computing
More effective
by orders of magnitude
Energy cost may overtake
IT equipment cost
in the near future
Current trends will lead to
unaffordable future operation
cost of our cyber infrastructure
© 2009, [email protected]
11
[Albert
Zomaya]
http://hartenstein.de
For a Booming Multicore Era
http://hartenstein.de
end of the
singlecore era
[email protected]
relative performance
94
96
98
00
02
year
04
06
08 10
12
14
16
18
20
22
24
26
28
30
von-Neumann-only is not the silver bullet
Reconfigurable Computing
is
indispensable!
12
© 2009, [email protected]
12
12
http://hartenstein.de
http://hartenstein.de
[email protected]
From CPU to RPU
Reconfigurable
Processing Unit
right now
resources
programming
property
source
property
sequencer
programming
source
ASIC
accelerator
hardwired
-
hardwired
-
CPU
hardwired
-
programmable
machine
model
RPU
Software program
(instruction
streams)
Configware
Flowware
programmable
(configuration
programmable
(data streams)
accelerator
code)
now accelerators
are programmable!
non-von-Neumann© 2009, [email protected]
state
register
counter
data
counters
we need 2 more program sources
13
http://hartenstein.de
http://hartenstein.de
[email protected]
A Multicore Submarine Model?
mapping parallelism just into the time domain:
“abstracting” away the space domain is fatal
C is not the silver bullet: it’s inherently serial
There is no easy way to program in parallel
But nobody wants to
learn a new language.
The programmer needs to understand how data flows
through cores, accelerators, interconnect and peripherals
The datastream model of the twin-paradigm approach
helps to understand the space domain and parallelism
The programmer* needs system visualization in the space
domain, to understand performance under parallelism
*) and, especially the student
© 2009, [email protected]
14 14
http://hartenstein.de
>> Outline <<
http://hartenstein.de
[email protected]
• The single-paradigm dictatorship
• Von Neumann vs. FPGA
• The Datastream Machine Model
• Avoiding address computation overhead
• The twin Paradigm approach
• Conclusions
© 2009, [email protected]
15
http://hartenstein.de
http://hartenstein.de
[email protected]
The first Reconfigurable Computer
• prototyped 1884
by Herman Hollerith
• a century before
FPGA introduction
• data-stream-based
• 60 years later the
von Neumann (vN)
model took over
The LUT
• instruction-stream-based
(lookup table)
© 2009, [email protected]
16
http://hartenstein.de
http://hartenstein.de
[email protected]
widening the semantic gap
[Harold „Bud“ Lawson]
unnecessary
complexity
inside
Burroughs B5000/5500:
language-friendly stack machine
IBM 260/370 & intel x86
highly complex instruction set
MULTICS (GE, Honeywell): well manageable (impl. in PL/1)
UNIX: complexity problems, compatibility problems
Pascal killed by C, coming as an infection, along with UNIX
KARL killed by VHDL, an infection coming along with Ada
© 2009, [email protected]
17
http://hartenstein.de
http://hartenstein.de
[email protected]
Languages turned into Religions
Java is a religion – not a language
[Yale Patt]
• teaching to students the tunnel view ^
of language designers
• falling in love with the subtleties
of formalismes
• instead of meeting the needs of the user
© 2009, [email protected]
18
http://hartenstein.de
http://hartenstein.de
[email protected]
Appeals to people
who do not know
what they are doing
It is alarming
[Fred Brooks]
Mastering even small complexity
creates a deep feeling of satisfaction
without solving the real problem
The transition from machine level
to higher level languages led to the
biggest productivity gain ever made
It‘s alarming that today‘s megabytes
of code are compiled from languages
at low abstraction levels (C, C++,Java)
© 2009, [email protected]
19
19
http://hartenstein.de
http://hartenstein.de
[email protected]
the catastrophe gap
[Harold „Bud“ Lawson]
complexity
catastrophe gap
migration to
FPGAs: the
silver bullet?
year
one of the reasons:
the von Neumann syndrome
© 2009, [email protected]
20
http://hartenstein.de
http://hartenstein.de
[email protected]
?
*
Speed-up factors
by GPGPUs (1)
[Michael Garland, NVIDIA Research: Parallel Computing
on Manycore GPUs; IPDPS, Rome, Italy, June 25-29, 2009]
GPUs can only be used
only in certain ways.
The power efficiency is disputable
?
effective only at problems
that can be solved using
stream processing.
programmer has to learn
irrelevant graphics concepts
data copy from main memory
to video memory is slow
© 2009, [email protected]
103
102
101
100
(up to ~150 x)
Speedup-Factor
such speed-ups by GPGPUs
only for embarrassingly
parallel applications
Jan
2007
149
146
130
100
50
30
18
July
2007
21 21
Jan
2008
July
2008
47
36
20
Jan
2009
Numerics
Imaging
Bioinformatics
Video
July
2009
Jan
2010
http://hartenstein.de
*) migration
from x86 singlecore
http://hartenstein.de
[email protected]
NVIDIA
GeForce
GTX
Speed-up factors by GPGPUs (2)
minium
stream
processor power supply
cores
recommended
275
240
650–680 watt
295
480
650–680 watt
Compute Unified Device
Architecture (CUDA),
accelerates BLAS
libraries (Basic Linear
Algebra Subroutines)
Less flexible
than FPGAs.
© 2009, [email protected]
Intel Core™2 Quad (desktop PCs): 4 cores
Intel Xeon "Nehalem-EX" for servers: 8 cores
(up to ~600 x)
103
Speedup-Factor
(GPGPU tool development
years earlier than f. x86)
http://www.nvidia.co.uk/object/cuda_home_uk.html#state=home
CUDA ZONE pages [NVIDIA Corp.]:
non-reviewed CUDA user submissions
EDA
675
500
340
270
327
420
250
169 270
170
150 260
169
138
150
109
172
100
100120
100
100
100 100
100
100
100
60
90
55
90
75
55
34 60 50
77
60
55
50
30
50
50
50
40
50
36
29 4035 30 50 39 50 50
35
35
31
35
32
35
26 27 23
29
25
26 1520 1630
20
20
20
17
16
15
13
15
15
12
13
10
12
10 10
10
10
10
10
10
10 9
10
9
7
8
9 7
5
5 5
5
5
4
8
4
.
3
3
.5
4
4
3
5
3
4
3
2 2
2
2
1.3
CFD Computational
Fluid Dyamics
Cryptography
470
102
DCC
oil &
gas
DSP
101
100
Jan
2007
July
2007
Jan
2008
2222
July
2008
Jan
2009
Astrophysics
Bioinformatics
July
2009
Digital
Content
Creation
Digital Signal
Processing
Graphics
Imaging
Jan
2010
Numerics
Video & Audio
*)http://hartenstein.de
migration from x86 singlecore
http://hartenstein.de
[email protected]
by Software
to Configware
migration
(up to ~30,000x) (200x)
vs. GPU: almost 50x
2 orders of magnitude
Speedup-Factor
Speed-up
factors
obtained
106
by FPGA:
Image processing,
Pattern matching,
28514
DES
breaking
Multimedia DSP and
6000
“... design techniques will evolve, by
necessity, to satisfy the demands of
reconfigurable hardware and software
programmability”. J. R. Rattner, DAC 2008
20
intel supports direct front
side bus access by FPGAs
© 2009, [email protected]
DNA & protein
sequencing
Reed-Solomon
Decoding
video-rate
stereo vision MAC
pattern 730
1000
900
recognition
400
103
100
wireless
real-time
face detection
SPIHT wavelet-based
image compression
52
40
BLAST
288
457
FFT
88
protein
identification
2400
8723
crypto
~50x
3000
CT imaging
1000
Viterbi Decoding
Smith-Waterman
pattern matching
100
CUDA
ZONE
(200x)
Garland
IPDPS‘09
molecular
dynamics
simulation
Bioinformatics
Astrophysics
GRAPE
23
23
http://hartenstein.de
[email protected]
Speedup-Factor
Software
vs. FPGA (2)
http://hartenstein.de
106
Massive
Energy
103
Saving factors: ~10%
of speedup factor
Image processing,
Pattern matching,
28514
DES breaking
Multimedia DSP and
wireless
real-time
face detection
6000
Reed-Solomon
Decoding
video-rate
stereo vision MAC
pattern 730
1000
900
recognition
400
SPIHT wavelet-based
image compression
52
40
20
http://hartenstein.de
© 2009,
100
DNA & protein
sequencing
BLAST
288
457
FFT
88
protein
identification
2400
8723
crypto
3000
CT imaging
1000
Viterbi Decoding
Smith-Waterman
pattern matching
100
molecular
dynamics
simulation
Bioinformatics
Astrophysics
GRAPE
[email protected]
© 2009, [email protected]
2424
http://hartenstein.de
http://hartenstein.de
[email protected]
RC*: Demonstrating the intensive Impact
Tarek
El-Ghazawi
[Tarek El-Ghazawi et al.: IEEE COMPUTER, Febr. 2008]
SGI Altix 4700 with RC 100 RASC compared to Beowulf cluster
Power
Savings
Cost
Size
8723
779
22
253
28514
3439
96
1116
Speed-up
factor
DNA and Protein
sequencing
DES breaking
Application
much less memory and bandwidth needed
*) RC = Reconfigurable Computing
© 2009, [email protected]
25
massively
saving energy
much less
equipment
needed
http://hartenstein.de
Why such Speed-up Factors ...
http://hartenstein.de
[email protected]
... with FPGAs: a much worse technology !
massive wiring overhead
+ massive reconfigurability overhead
+ routing congestion growing with FPGA size
The „Reconfigurable Computing Paradox“
main reason: no von Neumann Syndrome!
more recently also: more „platform FPGAs“
© 2009, [email protected]
26
http://hartenstein.de
>> Outline <<
http://hartenstein.de
[email protected]
• The single-paradigm dictatorship
• Von Neumann vs. FPGA
• The Datastream Machine Model
• Avoiding address computation overhead
• The twin Paradigm approach
• Conclusions
© 2009, [email protected]
27
http://hartenstein.de
http://hartenstein.de
[email protected]
Reconfigurability per se is not the key
It’s the paradigm coming along with it
Note: no instruction fetch at run time !
Data streams instead of instruction streams
Enabling technology for data sequencers
brings further performance improvements
A non-reconfigurable example is the BEE
project (Bob Broderson et al., UC Berkeley)
© 2009, [email protected]
28
http://hartenstein.de
http://hartenstein.de
[email protected]
„data stream“: an ambigouos definition
Reconfigurable Computing
is not instruction-stream-based
it‘s data-stream-based
it‘s different from the operation of the
(indeterministic) „dataflow machine“
other definitions also from multimedia area
usable definition from systolic array area
© 2009, [email protected]
29
http://hartenstein.de
http://hartenstein.de
[email protected]
introducing Data streams
time
x
x
x
(pipe network) DPA
time
execution transport-triggered
x
x
x
|
x
x
x
|
|
x x x
x x x -
H. T. Kung et al.,
[1979, 1980]
port #
time
- - - - x x x
- - - - - x x x
port #
|
|
|
|
|
|
|
|
|
|
|
x
x
x
x
x
x
time
© 2009, [email protected]
input
data
stream
- - - x x x
x x x - -
no memory wall inside
systolic array
30
|
x
x
x
port #
output
data
streams
Flowware defines:
... which data item
at which time
at which port
http://hartenstein.de
http://hartenstein.de
[email protected]
Classic Systolic Array Synthesis
algebraic methods
i. e., linear projections
yields only uniform
arrays w. linear pipes
only for applications with
regular data dependencies
© 2009, [email protected]
31
http://hartenstein.de
ASM
x x x
x x x -
ASM
x x x - -
ASM
data counters
instead of a
program counter
distributed
memory
© 2009, [email protected]
ASM
ASM
|
located at memory
(not at data path)
|
|
|
|
|
|
|
|
|
|
|
x
x
x
data counters:
|
|
x
x
x
The counterpart of the
von Neumann machine
x
x
x
- - - x x x
ASM
- - - - x x x
ASM
- - - - - x x x
ASM
GAG
RAM
data
counter
|
ASM: AutoSequencing
Memory
x
x
x
ASM
(r)DPA
x
x
x
x
x
x
ASM
coarsegrained
ASM
[email protected]
ASM
http://hartenstein.de
32
http://hartenstein.de
http://hartenstein.de
[email protected]
Who generates the Data
Streams?
Why the SA scene has missed to
invent the new machine paradigm
x
without a data
x x
sequencer it’s
x x x
x x | not a machine !!
x | |
reductionist
approach:
„it‘s not our job“
x x x
x x x -
- - - x xx
- - - - xx x
- - - - - x x x
x x x - |
|
|
|
|
|
|
|
x | |
x x |
x x x
x x
x
(it‘s not algebraic)
© 2009, [email protected]
|
33
http://hartenstein.de
http://hartenstein.de
[email protected]
array
systolic
array
Algebraic Synthesis Methods
applications
regular data
dependencies
only
supersystolic
rDPA
*
pipeline properties
shape
resources
linear
only
uniform
only
mapping
linear projection or
algebraic synthesis
simulated
annealing or
P&R algorithm
no restrictions
scheduling
(data stream
formation)
(e.g. force-directed)
scheduling
algorithm
*) KressArray [1995]
© 2009, [email protected]
34
http://hartenstein.de
http://hartenstein.de
[email protected]
Generalization of the Systolic Array
....
[Rainer Kress]
discard algebraic synthesis methods
use optimization algorithms instead,
for example: simulated annealing
flowware history:
1980: data streams
(Kung, Leiserson)
1995: super systolic rDPA
(Rainer Kress)
the achievement: also non-linear
and non-uniform pipes, and even
more wild pipe structures possible
1996+: SCCC (LANL), SCORE,
ASPRC, Bee (UCB),
now reconfigurability really makes sense
© 2009, [email protected]
35
http://hartenstein.de
http://hartenstein.de
[email protected]
KressArray principles
• take systolic array principles
• replace classical synthesis by simulated annealing
• yields the super systolic array
• a generalization of the systolic array
• no more restricted to regular data dependencies
• now reconfigurability makes sense
© 2009, [email protected]
36
http://hartenstein.de
Super-systolic Synthesis
http://hartenstein.de
[email protected]
array
systolic
array
applications
regular data
dependencies
only
supersystolic
rDPA
*
pipeline properties
shape
resources
linear
only
uniform
only
mapping
linear projection or
algebraic synthesis
simulated
annealing or
P&R algorithm
no restrictions
scheduling
(data stream
formation)
(e.g. force-directed)
scheduling
algorithm
*) KressArray [1995]
© 2009, [email protected]
37
http://hartenstein.de
http://hartenstein.de
[email protected]
coming close to
programmer‘s
mind set
(much closer
than FPGA)
3x3
fast
on-chip
RAM
Coarse-grained Reconfigurable Array example
image processing: SNN filter ( mainly a pipe network)
rout thru only
mesh-connected; exceptions: see
ASM
ASM
ASM
ASM
ASM
ASM
..
note: kind
ASM
of software
perspective,
ASM
but without
ASM
instruction
Legend:
rDPU not used
backbus connect
used for
routing only
backbus
connect
streams 
32 bits wide
datastreams . . . rDPU . . .
+ pipelining array size: 10 x 16 = 160 rDPUs
operator and routing
port location
not
usedmarker
compiled by Nageldinger‘s KressArray Xplorer
(Juergen Becker‘s CoDe-X inside)
© 2009, [email protected]
38
http://hartenstein.de
hypothetical branching example to illustrate
software-to-configware migration
http://hartenstein.de
section of a major pipe network on rDPU
[email protected]
S = R + (if C then A else B endif);
R B A
C =1
+
S
clock
200 MHz
C=1
simple conservative CPU example
read instruction
instruction decoding
if C
then read A read operand*
operate & reg. transfers
read instruction
if not C
then read B instruction decoding
read instruction
instruction decoding
add & store
operate & reg. transfers
store result
total
(5 nanosec)
© 2009, [email protected]
memory nano
cycles seconds
1
100
1
100
1
100
1
100
1
5
100
500
*) if no intermediate storage in register file
39
http://hartenstein.de
The wrong mind set ....
http://hartenstein.de
[email protected]
S = R + (if C then A else B endif);
section of a very
large pipe network:
R B A
C =1
„but you can‘t implement decisions!“
embarrassing remark not knowing this solution:
symptom of the
hardware / software chasm
+
© 2009, [email protected]
and the
configware / software chasm
40
http://hartenstein.de
introducing hardware description languages
http://hartenstein.de
[email protected]
(in the mid‘ seventies)
“The decision box becomes
a (de)multiplexer”
This is so simple:
why did it take decades to find out ?
The wrong mind set – the wrong road map!
© 2009, [email protected]
41
http://hartenstein.de
http://hartenstein.de
[email protected]
Xplorer Plot: SNN Filter Example
[13]
http://kressarray.de
2 hor. NNports, 32 bit
3 vert. NNports, 32 bit
route-thru-only rDPU
© 2009, [email protected]
+
result
operand
42
4242
operator
operand
route thru
backbus connect
http://hartenstein.de
http://hartenstein.de
[email protected]
Programming Language Paradigms
language category
both deterministic
operation
sequence
driven by:
state register
address
computation
Computer Languages
Languages f. Anti Machine
procedural sequencing: traceable, checkpointable
read next instruction,
read next data item,
goto (instr. addr.),
goto (data addr.),
jump (to instr. addr.),
jump (to data addr.),
instr. loop, loop nesting
data loop, loop nesting,
no parallel loops, escapes,
parallel loops, escapes,
instruction stream branching data stream branching
program counter
data counter(s)
massive memory
overhead avoided
cycle overhead
Instruction fetch
parallel memory
bank access
memory cycle overhead
overhead avoided
interleaving only
no restrictions
language features
control flow +
data manipulation
data streams only
(no data manipulation)
© 2009, [email protected]
43
http://hartenstein.de
http://hartenstein.de
[email protected]
Double Dichotomy
1) Paradigm Dichotomy
Datastream Machin
von Neumann Machine
data stream
instruction stream
(Software-Domain)
(Flowware-Domain)
2) Relativity Dichotomy
space:
time:
-Procedure
-Structure
(Software-Domain)
© 2009, [email protected]
(Configware-Domain)
44
http://hartenstein.de
http://hartenstein.de
[email protected]
time
Relativity Dichotomy
space
(time
space domain:
time domain:
procedure domain
structure domain
3 phases:
1) reconfiguration
of structures
2) programming
data streams
3) run time
2 phases:
1) programming
instruction streams
2) run time
von Neumann Machine
© 2009, [email protected]
time/space)
Datastream Machine
45
http://hartenstein.de
http://hartenstein.de
[email protected]
time-iterative to space-iterative
n time steps,
1 CPU
Often the space dimension is limited
n*k time steps,
a time
to space
mapping
a time to
space/time
mapping
1 time step,
n DPUs
n time steps,
loop transformation methodogy:
70ies and later
e. g. example: bubble sort migration
© 2009, [email protected]
1 CPU
46
k DPUs
Strip mining
[D. Loveman, J-ACM, 1977]
http://hartenstein.de
time to space mapping
http://hartenstein.de
[email protected]
time domain:
procedure domain
time algorithm
program loop
space domain:
structure domain
space algorithm
pipeline
1 time step, n DPUs
n time steps, 1 CPU
Bubble Sort
conditional
swap
x
y
Shuffle Sort
n x k time steps,
1 „conditional
swap“ unit
time algorithm
© 2009, [email protected]
k time steps,
n „conditional
swap“ units
conditional
swap
conditional
swap
conditional
swap
conditional
swap
space/time algorithms
4747
http://hartenstein.de
http://hartenstein.de
[email protected]
Loop Transformation Examples
sequential processes:
loop 1-16
body
endloop
resource parameter driven
Co-Compilation
host:
loop 1-8
trigger
endloop
loop 1-8
fork
body
body
loop 1-8 loop 9-16
endloop body
body
endloop endloop
loop
unrolling
loop 1-4
trigger
endloop
loop 1-2
trigger
endloop
join
strip mining
© 2009, [email protected]
reconf.array:
48
http://hartenstein.de
http://hartenstein.de
[email protected]
MPSoC Programming model: Flowware
FMDemod
Split
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
Gather
Adder
Source:
MIT
StreamIT
Speaker
© 2009, [email protected]
• Pros for streaming [Pierre Paulin]
– Streamlined, low-overhead
communication
– (More) deterministic behaviour
– Good match for many simple
media rich applications
• Cons
– control-dominated applications
– shunt yard problem
We‘ve to find out, which applications
types and programming models Students
should exercise for the flowware approach
49
http://hartenstein.de
http://hartenstein.de
[email protected]
The new paradigm:
how the data are traveling
no, not by instruction execution
transport-triggered: an old hat
pipeline, or chaining
asynchronous (via handshake)
systolic array
wavefront array
© 2009, [email protected]
50
http://hartenstein.de
http://hartenstein.de
[email protected]
How the data are moved
DMA,
vN move processor [Jack Lipovski, EUROMiCRO, Nice, 1975]
[TU-KL publ.:
ASM use GAG generic address generator Tokyo 1989 +
by the way: GAG st…. by TI [TI patent 1995] NH journal]
Henk Corporaal coins the term “transport-triggered”
MoM: GAG-based storage scheme methodology [Herz*]
Application-specific distributed memory [Catthoor et al.]
*) [see Michael Herz et al.: ICECS 2002 (Dubrovnik)]
© 2009, [email protected]
51
http://hartenstein.de
http://hartenstein.de
[email protected]
The Paradigm Shift to Data-Stream-Based
The Method of Communication
and Data
Transport
the von
Neumann
syndrome
by Software
by
Configware
complex
pipe
network
on rDPA
© 2009, [email protected]
52
http://hartenstein.de
http://hartenstein.de
[email protected]
Illustrating the von Neumann paradigm trap
the watering pot model
[Hartenstein]
The instruction-stream-based approach
many watering pots
The data-stream-based approach
has no von
Neumann
bottleneck
von
Neumann
bottleneck
© 2009, [email protected]
53
http://hartenstein.de
Main program:
goto PixMap[1,1]
http://hartenstein.de
4
2
3
1
JPEG zigzag scan pattern
*> Declarations
[email protected]
HalfZigZag;
EastScan is
Flowware language example (MoPL):
step by [1,0]
SouthWestScan
programming the datastream
uturn (HalfZigZag)
end EastScan;
SouthScan is
step by [0,1]
1 2 3 4 5 6 7 8
endSouthScan;
x
x
NorthEastScan is
loop 6 times until [*,1]
y
step by [1,-1]
dataHalfZigZag
counter
data counter
endloop
end NorthEastScan;
1
SouthWestScan is
loop 7 times until [1,*]
step by [-1,1]
2
3
endloop
4
end SouthWestScan;
6
7
data counter
endloop
end HalfZigZag;
© 2009, [email protected]
54
54
data counter
HalfZigZag
HalfZigZag is
EastScan
loop 3 times
SouthWestScan
SouthScan
NorthEastScan
EastScan
5
8
y
(an animation)
http://hartenstein.de
>> Outline <<
http://hartenstein.de
[email protected]
• The single-paradigm dictatorship
• Von Neumann vs. FPGA
• The Datastream Machine Model
• Avoiding address computation overhead
• The twin Paradigm approach
• Conclusions
© 2009, [email protected]
55
http://hartenstein.de
http://hartenstein.de
[email protected]
Significance of Address Generators
• Address generators have the potential to reduce
computation time significantly.
• In a grid-based design rule check a speed-up of
more than 2000 has been achieved, compared to a
VAX-11/750
• Dedicated address generators contributed a
factor of 10 - avoiding memory cycles for address
computation overhead
© 2009, [email protected]
56
http://hartenstein.de
http://hartenstein.de
[email protected]
Generic Address Generator GAG
Generalization of the DMA
GAG
Acceleration factors by:
data
counter
• address computation
without memory cycles
• storge scheme optimization
methodology, etc.
GAG & enabling technology published 1989, survey:
[M. Herz et al.: IEEE ICECS 2003, Dubrovnik]
© 2009, [email protected]
57
patented by TI 1995
http://hartenstein.de
http://hartenstein.de
[email protected]
ASM: Auto-Sequencing Memory
Generalization
of the DMA
Acceleration factor:
generic address
generator GAG for
address computation
without memory cycles
ASM
ASM: AutoASM
ASM Sequencing
Memory
GAG ASM
GAG RAM
RAM
GAG RAM
dataGAG RAM
data
counter
data
counter
data
counter
counter
data counters
instead of a
Program Counter
... partly explaining the
RC paradox
Acceleration by
Storge Scheme
optimization
methodology, etc.
GAG & enabling technology:
published 1989, survey: [M. Herz et
al.: IEEE ICECS 2003, Dubrovnik]
patented by TI 1995
© 2009, [email protected]
58
http://hartenstein.de
http://hartenstein.de
[email protected]
Migration benefit by on-chip RAM
Some RC chips have hundreds of on-chip RAM blocks,
orders of magnitude faster than off-chip RAM
so that the drastic code size reduction by software
to configware migration can beat the memory wall
multiple on-chip RAM blocks are the enabling
technology for ultra-fast anti machine solutions
GAGs inside
ASMs generate
data streams
GAG = generic
address generator
rDPA = rDPU array, i.
e. coarse-grained
rDPU = reconf. datapath
unit (no program counter)
© 2009, [email protected]
ASM
ASM
ASM
ASM: AutoSequencing
Memory
rDPA
ASM
rDPU
rDPU
rDPU
ASM
ASM
rDPU
rDPU
rDPU
ASM
ASM
rDPU
rDPU
rDPU
ASM
ASM
ASM
59
ASM
GAG
RAM
data
counter
http://hartenstein.de
http://hartenstein.de
[email protected]
Acceleration Mechanisms
•parallelism by multi bank memory architecture
•auxiliary hardware for address calculation
•address calculation before run time
•avoiding multiple accesses to the same data.
•avoiding memory cycles for address computation
•optimization by storage scheme transformations
•optimization by memory architecture transformations
© 2009, [email protected]
60
http://hartenstein.de
http://hartenstein.de
[email protected]
Configware Compilation
ASM
AutoConfigware
ASM ASM:
Sequencing
ASM
C,
FORTRAN
Engineering MATHLABGAG ASM Memories
RAM
GAG
placement source „program“
RAM
GAG RAM
& routing
dataGAG RAM
configware compilation
fundamentally different
from software compilation
data
counter
data
counter
data
counter
counter
mapper
configware
compiler programming
the data
configware data scheduler
counters
code
x
x x
x x x
x x |
x | |
rDPA
pipe network
© 2009, [email protected]
x x x
x x x x x x - -
flowware code
6161
- - - x xx
- - - - xx x
- - - - - x x x
|
|
|
|
|
|
|
|
|
x | |
x x |
x x x
x x
x
data streams
http://hartenstein.de
Generic Sequence Examples
http://hartenstein.de
[email protected]
L0 DA B0
atomic scan
linear scan
a)
Address
Stepper
Limit
Slider
video scan
b)
A
Base
Slider
GAU
-90º rotated video scan
c)
-45º rotated (mirx (v scan))
until
sheared video scan
non-rectangular video scan
zigzag video scan
d)
e)
f)
g)
spiral scan
feed-back-driven scans
perfect
shuffle
© 2009, [email protected]
6262
http://hartenstein.de
GAG Slider Model
http://hartenstein.de
[email protected]
scan pattern example for
scan line number:
illustration of the slider model.
1
2
3
a) total address
y
b) x address
sliders
c) y address
9
8
7
6
5
4
3
2
1
y-scanline number
c)
Limit
Slider
Address
Stepper
A
a)
1 2 3 4 5 6 7 8 9
3 2 1
L0 DA B0
1
2
3
x-scan linenumber
© 2009, [email protected]
x
Base
Slider
GAG
Generic
Address
Generator
b)
sliders
6363
http://hartenstein.de
>> Outline <<
http://hartenstein.de
[email protected]
• The single-paradigm dictatorship
• Von Neumann vs. FPGA
• The Datastream Machine Model
• Avoiding address computation overhead
• The twin Paradigm approach
• Conclusions
© 2009, [email protected]
64
http://hartenstein.de
http://hartenstein.de
[email protected]
Dual paradigm mind set: an old hat
(mapping from procedural to structural domain)
Software mind set:
instruction-stream-based:
flow chart ->
control instructions
token bit
evoke
FF
FF
FF
Mapped into a Hardware mind set:
action box = Flipflop, decision box = (de)multiplexer
1967:
1972:
W. A. Clark: Macromodular Computer Systems; 1967 SJCC, AFIPS Conf. Proc.
C. G. Bell et al: The Description and Use of Register-Transfer Modules
(RTM's); IEEE Trans-C21/5, May 1972
© 2009, [email protected]
6565
http://hartenstein.de
http://hartenstein.de
[email protected]
Nick Tredennick’s Paradigm Shifts
explain the differences
Software Engineering
CPU
software
resources: fixed
algorithm: variable
1 programming
source needed
Configware Engineering
configware
flowware
© 2009, [email protected]
resources: variable
algorithm: variable
6666
2 programming
sources needed
http://hartenstein.de
http://hartenstein.de
[email protected]
Machine
model
ASIC
accelerator
Our Contemporary Computer Machine Model
resources
programming
property
source
property
hardwired
hardwired
-
sequencer
programming
state register
source
Software
CPU
hardwired
programmable (instruction
streams)
Configware
Flowware
RPU
programmable
(data
accelerator programmable (configuration
code)
streams)
program
counter
in CPU
data
counters
in RAM
data counters of reconfigurable
address generators in asM (autosequencing) data memory blocks
twin Paradigm Dichotomy
the same language primitives!
© 2009, [email protected]
67
http://hartenstein.de
http://hartenstein.de
[email protected]
Compilation: Software vs. Configware
Software
Engineering
source program
Configware
Engineering
C, FORTRAN
MATHLAB
placement source „program“
& routing
mapper
software
compiler
configware
compiler
data scheduler
software code
© 2009, [email protected]
configware code
68
flowware code
http://hartenstein.de
Co-Compilation
http://hartenstein.de
[email protected]
C, FORTRAN, MATHLAB
automatic SW / CW partitioner
Software /
Configware
software Co-Compiler
compiler
mapper
configware
compiler
data scheduler
software code
© 2009, [email protected]
configware code
69
flowware code
http://hartenstein.de
http://hartenstein.de
[email protected]
Co-Compiler for Hardwired Anti Machine
[e. g. Brodersen]
source
automatic SW / CW partitioner
Software /
software
Flowware
compiler Co-Compiler
flowware
compiler
data scheduler
software code
© 2009, [email protected]
flowware code
70
http://hartenstein.de
http://hartenstein.de
[email protected]
A Heliocentric CS Model
time to space
mapping
issue
SE
Software
Engineering
RPUFE
PE
Program
Engineering
The Generalization of
Software Engineering —
© 2009, [email protected]
Flowware
Engineering
*) do not
confuse
with
„dataflow“!
A Twin Paradigm Dual
Dichotomy Approach.
71
http://hartenstein.de
Time to Space Mapping
http://hartenstein.de
[email protected]
Machine
model
ASIC
accelerator
resources
programming
property
source
property
hardwired
hardwired
David Parnas
Software
CPU
hardwired
programmable (instruction
streams)
Configware
Flowware
RPU
programmable
(data
accelerator programmable (configuration
code)
streams)
loop turns
2 pipeline
-
sequencer
programming
state register
source
Relativity Dichotomy
C
program
counter
data
counters
C
1967
„The biggest payoff will come from Putting Old ideas into
Practice and teaching people how to apply them properly.“
© 2009, [email protected]
72
http://hartenstein.de
Dual Paradigm
Application Development
http://hartenstein.de
[email protected]
high level language
Juergen Becker’s
CoDe-X, 1996
C language source
Partitioner
SW
compiler
CPU
CW
compiler
software/configware
co-compiler
software code
instructionstreambased
rDPU rDPU rDPU rDPU
CPU
rDPU rDPU rDPU rDPU
reconfigurable
accelerator
hardwired
accelerator
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
© 2009, [email protected]
configware code
datastreambased
73
http://hartenstein.de
>> Outline <<
http://hartenstein.de
[email protected]
• The single-paradigm dictatorship
• Von Neumann vs. FPGA
• The Datastream Machine Model
• Avoiding address computation overhead
• The twin Paradigm approach
• Conclusions
© 2009, [email protected]
74
http://hartenstein.de
http://hartenstein.de
[email protected]
Ways to implement an Algorithm
RAM-based
• Hardware
• Software
• Configware
+ Flowware
• mixed
© 2009, [email protected]
von
Neumannmachine
singlecore
multicore
manycore
datastream
machine
. manycore
per se
75
http://hartenstein.de
Flowware
http://hartenstein.de
[email protected]
Flowware means parallelism resulting
from time to space migration
Flowware: scheduling data streams from a generalization of the systolic array
supports any wild free form of pipe networks:
spiral, zigzag, fork and join, and even more wild,
unidirectional and fully or partially bidirectional,
Fifos, stacks, registers, register files, RAM blocks...
© 2009, [email protected]
76
http://hartenstein.de
http://hartenstein.de
[email protected]
Software Education (R)evolution:
step by step, not overthrowing the SE scene
by simultaneous dual domain co-education:
traditional qualification in the time domain
+ lean qualification in the space domain
= lean hardware modeling qualification
at a higher level of abstraction
viable methodology for dual rail education
(only a few % curricula need to be changed)
© 2009, [email protected]
77
7777
http://hartenstein.de
http://hartenstein.de
[email protected]
RC versus Multicore
„RC“ =
Reconfigurable
Computing
RC: speed-up often higher
by orders of magnitude
Sure !
RC: energy-efficiency often higher:
very much, or, by orders of magnitude ?
this is the
silver bullet
Sure !
We need both: Multicore and RC
© 2009, [email protected]
78
Multicore:
legacy software,
control-intensive
applications, etc.
http://hartenstein.de
http://hartenstein.de
[email protected]
We need new courses
We need undergraduate lab courses
with HW / CW / SW partitioning
We need new courses with extended
scope on parallelism
and
algorithmic cleverness for HW / CW
/ SW co-design
“We urgently need a Mead-&Conway-like text book “
[R. H., Dagstuhl Seminar 03301,Germany, 2003]
2007
here its foreruner: but not yet
twin paradigm
© 2009, [email protected]
79
http://hartenstein.de
SERUM-RC*
http://hartenstein.de
[email protected]
“We urgently need a Mead-&-Conway-style new
mass movement community“
Software Education Revolution
for using Multicore - and RC* (SERUM-RC*)
*) Reconfigurable Computing
“We urgently need a Mead-&-Conwaydimension text book on
twinparadigm programming education“
© 2009, [email protected]
8080
http://hartenstein.de
http://hartenstein.de
[email protected]
thank you
© 2009, [email protected]
81
http://hartenstein.de
http://hartenstein.de
[email protected]
END
© 2009, [email protected]
82
http://hartenstein.de