Coarse Grain Reconfigurable Architectures

Download Report

Transcript Coarse Grain Reconfigurable Architectures

Swedish INTELECT
Summer School
on Multiprocessor
Systems on Chip
Örebro, Aug. 25-27, 2003
8.45 – 10.30 hrs
Reiner Hartenstein
Kaiserslautern
University of
Technology
Reconfigurable Computing and
its Compilation Techiques
Kaiserslautern
University of
Technology
Reconfigurable Computing:
a second programming domain
Migration of programming to the structural domain
The structural domain has become RAM-based
The opportunity to introduce
the structural domain to programmers ...
... to bridge the gap by clever abstraction mechanisms
using a simple new machine paradigm
© 2003, [email protected]
2
http://hartenstein.de
Kaiserslautern
University of
Technology
http://www.uni-kl.de
>> outline <<
• why coarse grain reconfigurable ?
• terminology
• toward higher abstraction levels
• flowware languages
• why a new Machine Paradigm ?
• (co-) compilation techniques
• final remarks
© 2003, [email protected]
3
http://hartenstein.de
granularity
Kaiserslautern
University of
Technology
Datapath width
1 bit CLB:
fine grain
Word level CFB:
coarse grain
bundling of nibble or byte width CFBs: multiple granularity
© 2003, [email protected]
4
http://hartenstein.de
Kaiserslautern
University of
Technology
One more argument for coarse grain
we have already
seen the first day:
MOPS / mW
1000
T. Claasen et al.: ISSCC 1999
*) R. Hartenstein: ISIS 1997
100
10
1
0.1
Wiring by abutment:
a 32 Bit KressArray
example
0.01
0.001
2
© 2003, [email protected]
1
0.5
0.25
5
if coarse grain cells
are full custom and
mesh-connected,
and 2nd level
interconnect
ressources layouted
over the cells
the array is
almost as
area-efficient
as hardwired
0.13 0.1 0,07 µ feature size
http://hartenstein.de
Kaiserslautern
University of
Technology
mapping algorithms efficently onto rDPA
SNN filter on KressArray
rout thru only
array size:
10 x 16
= 160 rDPUs
Legend:
rDPU not used
backbus connect
used for
routing only
backbus
connect
operator and routing
port location
not
used marker
by the way: example of scalability / relocatability by EDA support
also FPGA scalability (avoid routing congestion) by EDA solution
© 2003, [email protected]
6
http://hartenstein.de
Kaiserslautern
University of
Technology
Xplorer Plot: SNN Filter Example
http://kressarray.de
2 hor. NNports, 32 bit
3 vert. NNports, 32 bit
route-thru-only rDPU
© 2003, [email protected]
[13]
+
result
operand
7
operator
operand
route thru
backbus connect
http://hartenstein.de
Kaiserslautern
University of
Technology
PACT XPP: Reference Module: XPU128 Co-Processor
XPP128 rDPA
ALU
• Full 32 or 24 Bit Design working silicon
• 2 Configuration Hierarchies
buses
not
shown
Ctrl
CFG
rDPU
PAE
core
• Evaluation Board
• XDS Development Tool with Simulator
• all used by SIEMENS Corporation
• Other contractors preparing .... : ask
© 2003, [email protected]
Ron Mabry (here in the audience)
http://hartenstein.de
8
Kaiserslautern
University of
Technology
Mikroprozessorarchitekturen (8):
hochgradig parallele Systeme
E/A
Konfiguration
Manager
E/A
microprocessor architectures (8)
SRAM
PE
PE
PE
PE
PE
PE
PE
PE
PE
SRAM
SRAM
PE
PE
PE
PE
PE
PE
PE
PE
PE
SRAM
SRAM
PE
PE
PE
PE
PE
PE
PE
PE
PE
SRAM
SRAM
PE
PE
PE
PE
PE
PE
PE
PE
PE
SRAM
SRAM
PE
PE
PE
PE
PE
PE
PE
PE
PE
SRAM
SRAM
PE
PE
PE
PE
PE
PE
PE
PE
PE
SRAM
SRAM
PE
PE
PE
PE
PE
PE
PE
PE
PE
SRAM
©Arndt
Bode
LRR-TUM
© 2003,
[email protected]
©Arndt
Bode LRR-TUM
9
TU Dresden, 09.05.2003
E/A
E/A
http://hartenstein.de
9
XPP64A: Platform Development Board
Kaiserslautern
University of
Technology
- SDR Board In Debug Phase -> XPP64A Chips from STMicro Fab
- Assembly & Test / Available March 2003
© 2003, [email protected]
10
http://hartenstein.de
PACT Corp
Kaiserslautern
University of
Technology
• Xtreme Processor Platform (XPP) family of IP cores, high-speed
data-stream-capable, scalable, reconfigurable clusters of arrays of
32-bit DPUs with embedded memories, and high-speed I/O ports • Application development support software featuring a flow graphstyle algorithm mapping language - to minimize training requirements.
• XPP's fabrics, featuring automatic DataFlow synchronization and
flagged Event Network to dynamically configure the execution flow,
• Supports dynamic RTR: hierarchical configuration managers free the
designer from chip-level details and ensure that configurations are
independently loaded in exactly the intended order.
• Automatic event-based task swapping along with data streams:
released resources automatically reconfigured immediately
© 2003, [email protected]
11
http://hartenstein.de
Kaiserslautern
University of
Technology
Entwicklung der Mikroprozessor Architekturen (1)
Bis 1995: Einschränkung - , seit 1995 Erhöhung der Typen- und Architekturvielfalt
Transistorzahl (Moore‘s Gesetz): Abwägung Rechenleistung-Leistungsaufnahme-KostenKompatibilität
MPR Analysts‘ Choice Awards Kategorien:
- PC Processors: Intel P4 (HyperThreading), AMD Athlon (x 86-64,
Hyper Transport), Transmeta (Binary Compilation, VLIW),...
- Server Processors: Intel Xeon MP und Itanium 2 (EPIC), AMD Opteron
(x86-64), HP Alpha EV-7, Fujitsu Sparc 64 V (out-of-order superscalar)
- High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440
GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC VR7701, PMC Sierra
RM9000x2
- Low-Power Embedded Processors: AMD Au1100, Intel PXA 250, NEC VR
4131, DragonBall MX1, NeoMagic MiMagic5 (1mW pro MHz)
- Extreme Processors: CmU PipeRench, Intrinsity FastMath, Micron Yukon,
microprocessor architectures (1)
NEC DRP, PACT XPP, Sandbridge Sand Blaster (bis 512 ALUs)
- Embedded IP Processor Cores: ARCtangent-A5, ARM 1026 EJ-S/1136JF-S,
Improv Crescendo, MIPS M4K, Tensilica Xtensa V
- Graphics Processors: 3Dlabs Wildcat VP900, ATI Radeon 9700, Nvidia GeForce FX
©Arndt
Bode
LRR-TUM
© 2003,
[email protected]
12
http://hartenstein.de
12
Kaiserslautern
University of
Technology
wide variety of speed-up factors
key issue: algorithmic cleverness
platform
speed-up
factor
application
PACT Xtreme
4-by-4 array 16 tap FIR filter
[2003]
MoM
anti machine
with DPLA*
[1983]
straight
x16 MOPS/mW
forward
grid-based DRC**
1-metal 1-poly nMOS
256 reference patterns
> x1000
(computation
time)
*) MPC fabrication via E.I.S. multi university project
© 2003, [email protected]
method
13
multiple
aspects
**) Design Rule Check
http://hartenstein.de
Kaiserslautern
University of
Technology
instruction stream-based Compilation Principles
1-D memory space
source text
parser
library
link/load
instruction call placement
scheduler
execution order by location
© 2003, [email protected]
14
http://hartenstein.de
Kaiserslautern
University of
Technology
Datastream-based Compilation Principles
library
mapper
placement
& routing
scheduler
data stream assembly
© 2003, [email protected]
15
http://hartenstein.de
Kaiserslautern
University of
Technology
© 2003, PACT AG
Sequential Processor Model
Conventional processors use the sequential model:
Each operation takes one clock cycle.
Multiple operations are computed consecutively.
Register
Operation 1
Operation 2
Operation 3
Operation 4
Operation 5
Time
© 2003, [email protected]
16
http://hartenstein.de
Kaiserslautern
University of
Technology
© 2003, PACT AG
A New Parallel Processor Paradigm
Multiple computations are configured as code sections onto
a two dimensional array.
y
Data Buffer
x
© 2003, [email protected]
17Time
http://hartenstein.de
Kaiserslautern
University of
Technology
Parallel Processor Model
© 2003, PACT AG
Multiple code sections are computed sequentially.
y
Section 1
x
Operation 2
Section 2
Section 3
© 2003, [email protected]
Time
18
http://hartenstein.de
Kaiserslautern
University of
Technology
Dataflow Performance
© 2003, PACT AG
Traditional Microprocessor
XPP Architecture
Instruction
Memory and cache
ALU
Configuration
Memory and cache
Register
ADD
MULT
Array of ALUs
One word
Filter
One operation
per cycle
FFT
SHIFT
Basic machine operations
performed on
single words
© 2003, [email protected]
Buffer
Stream
of words
Many
operations
per cycle
Viterbi
19
Complex Functions
performed on
data streams
http://hartenstein.de
Dataflow Synchronisation: Transport
Triggered
Kaiserslautern
University of
Technology
3
2
1
3
13
5
© 2003, [email protected]
3
6
3
13
4
13
7
20
8
http://hartenstein.de
XPP: Parallel Algorithm Example
Kaiserslautern
University of
Technology
Matrix Multiplication
a b
c d
x
x
=
y
Flow Graph
a
x
© 2003, [email protected]
ax+by
cx+dy
x
c
PACT
x’
=
y’
y
b
x
x
+
+
x’
y’
21
Matrix is
Constant
d
x
http://hartenstein.de
XPP: Parallel Algorithm Example
Kaiserslautern
University of
Technology
a
x
x
I/O
c
y
b
x
x
+
x’
+
y’
PACT
d
x
a
I/O
MUL
• SCM configures
Opcodes and Constant
Registers via CM
SCM
+
CM
Note: MAC Opcode is not used in this example to improve clarity of the
presentation
http://hartenstein.de
© 2003, [email protected]
22
XPP: Parallel Algorithm Example
Kaiserslautern
University of
Technology
a
x
x
in_x
c
x
x
+
x’
+
y’
x
c
b
d
MUL
in_y mul1
MUL
mul2
MUL
mul3
MUL
mul4
adder1
© 2003, [email protected]
PACT
d
a
ADD
SCM
+
CM
y
b
out_x out_y
• CM Configures
Opcodes and
Constant Registers
ADD
adder2
23
http://hartenstein.de
XPP: Parallel Algorithm Example
Kaiserslautern
University of
Technology
a
x
x
c
y
b
x
x
+
x’
+
y’
PACT
d
x
y
a
c
b
d
y’
x
MUL
in_y mul1
MUL
mul2
MUL
mul3
MUL
mul4
x’
ADD
ADD
in_x
SCM
+
CM
adder1
© 2003, [email protected]
out_x out_y
• CM Configures
Routing Resources
adder2
24
http://hartenstein.de
XPP: Parallel Algorithm Example
Kaiserslautern
University of
Technology
a
x
x
y
x
y
b
x
x
+
x’
+
y’
x
c
b
d
MUL
in_y mul1
MUL
mul2
MUL
mul3
MUL
mul4
ADD
ADD
adder1
© 2003, [email protected]
PACT
d
a
I/O
in_x
c
I/O
y’
x’
out_x out_y
• Data Packets are
routed through the
Network
adder2
25
http://hartenstein.de
Kaiserslautern
University of
Technology
http://www.uni-kl.de
>> terminology <<
• why coarse grain reconfigurable ?
• terminology
• toward higher abstraction levels
• flowware languages + mapping
• why a new Machine Paradigm ?
• (co-) compilation techniques
• final remarks
© 2003, [email protected]
26
http://hartenstein.de
Tredennick’s Paradigm Shifts
Kaiserslautern
University of
Technology
standard
TTL
1957
custom
hardwired
1967
LSI,
MSI
procedural programming
µproc.,
memory
1977
structural programming
2007
1987
ASICs,
accel’s
1997
2 sources
algorithm: fixed
algorithm: variable
algorithm: variable
resources: fixed
resources: fixed
resources: variable
© 2003, [email protected]
vN machine
paradigm
27
new machine
paradigm needed
http://hartenstein.de
Paradigm Shifts:
Nick Tredennick‘s view
Kaiserslautern
University of
Technology
why 2 program sources ?
reconfigurable
computing:
instruction-streambased computing:
algorithms variable
algorithms variable
resources fixed
resources variable
programmable
© 2003, [email protected]
28
http://hartenstein.de
Kaiserslautern
University of
Technology
programming media and
platforms
Co-Compilation
software
data adress
generators
asM - auto
sequencing
data Memory
program
Memory
instruction
stream
hardware
© 2003, [email protected]
...
interface
µProcessor
flowware
data
streams
Reconfigurable
Accelerators
morphware
29
configware
configuration
Memory
bit
stream
http://hartenstein.de
Kaiserslautern
University of
Technology
Placement & routing (configware) done:
... which data item
flowware defines ....
at which time
at which port
time
x
x
x
DPA
time
x
x
x
|
x
x
x
|
|
x x x
x x x -
time
- - - - x x x
- - - - - x x x
port #
|
|
|
|
|
|
|
|
|
|
|
x
x
x
30
port #
- - - x x x
x x x - -
© 2003, [email protected]
input data streams
time
x
x
x
port #
output data streams
|
x
x
x
http://hartenstein.de
Kaiserslautern
University of
Technology
Terminology: Digital System Platforms
clearly distinguished
source
running on it
platform
hardware
(not running on it)
fine grain rGA (FPGA)
configware
morphware coarse
rDPU, rDPA
grain
reconfigurable flowware &
data stream
configware
processor
data stream processor (hardwired)
flowware
instruction stream processor
software
© 2003, [email protected]
machine
paradigm
31
none
anti machine
von Neumann
machine
http://hartenstein.de
Kaiserslautern
University of
Technology
http://www.uni-kl.de
>> higher abstraction levels <<
• why coarse grain reconfigurable ?
• terminology
• toward higher abstraction levels
• flowware languages + mapping
• why a new Machine Paradigm ?
• (co-) compilation techniques
• final remarks
© 2003, [email protected]
32
http://hartenstein.de
Kaiserslautern
University of
Technology
„EDA industry shifts into CS mentality“
[Wojciech Maly]
• patches instead of engineering
• innovation stalled many years ago
• netlist-based: do not care about efficiency, ...
• ... do not care about transistor density
• 85% users hate their tools
© 2003, [email protected]
33
http://hartenstein.de
Kaiserslautern
University of
Technology
Development of Hypergrowth Markets
Harper Business 1995
Mainstream
Tornado
Paradigm
Shift
© 2003, [email protected]
34
http://hartenstein.de
Kaiserslautern
University of
Technology
McKinsey Curve: dynamics of R&D disciplines
new discipline on top of it by ....
maturity of
a discipline
saturation: limitations met
... by innovation
consolidation
year
fundmental issues
© 2003, [email protected]
35
http://hartenstein.de
EDA Industry Revolutions
Kaiserslautern
University of
Technology
EDA industry paradigm
switching every 7 years
courtesy
[Keutzer / Newton]
1999
1992
HLLs, (Co-) Compilation
Data-Stream-based DPU arrays
Synthesis: Cadence, Synopsys ...
1985
1978
coming closer to
programmers‘ mind set
2006
Schematics entry: Daisy, Mentor, Valid ...
Transistor entry: Applicon, Calma, CV ...
© 2003, [email protected]
36
http://hartenstein.de
Kaiserslautern
University of
Technology
SoC System level Design:
Embedded SW (ESW) (ECW)
ESW becomes main vehicle to product differentiation
ECW
ESE becomes the main focus in system design:
CW- HW-(E)SW codesign onto highly programmable
platforms (SoC)
new design automation from high level descriptions
CW and SW synthesis included (SoC)
CW- HW-(E)SW-co-verification
H.]
formal verification for (E)SW and CW
© 2003, [email protected]
37
http://hartenstein.de
Kaiserslautern
University of
Technology
Complexity: System Level Design Challenge
[ITRS 2001]
“abstraction levels must be raised above present-day RT-level
from HW + (processor-dependent embedded) C code level
language infrastructures for complex models (SystemC etc.)
must be leveraged by industry consensus
on use-methodology and abstraction levels”
© 2003, [email protected]
38
http://hartenstein.de
Kaiserslautern
University of
Technology
http://www.uni-kl.de
>> flowware languages <<
• why coarse grain reconfigurable ?
• terminology
• toward higher abstraction levels
• flowware languages + mapping
• why a new Machine Paradigm ?
• (co-) compilation techniques
• final remarks
© 2003, [email protected]
39
http://hartenstein.de
Kaiserslautern
University of
Technology
mathematic methods for systolic array synthesis
time
good reading: Nikolay Petkov:
Systolic Parallel Processing;
North-Holland; 1992
DPA
linear projection
or
algebraic method
mapping
math formula
preprocessing
x
x
x
time
© 2003, [email protected]
|
|
architecture
40
port #
- - - x x x
time
- - - - x x x
x x x - -
DPU
input data streams
|
x x x
x x x -
- - - - - x x x
port #
only uniform DPA
with linear pipes:
only for applications
with strictly regular
data dependencies
x
x
x
x
x
x
|
|
|
|
|
|
|
|
|
|
|
x
x
x
time
x
x
x
port #
output data streams
|
x
x
x
http://hartenstein.de
Kaiserslautern
University of
Technology
Compilation for (r)DPA of anti machine
high level source program
(software notation)
parameters
wrapper
expression
morphware
tree
DPU library
configware
mapper
code
generators
scheduler
simulated
annealing
streamware
flowware
© 2003, [email protected]
41
http://hartenstein.de
Super Pipe Networks
Kaiserslautern
University of
Technology
array
systolic
array
applications
regular data
dependencies
only
supersystolic
rDPA
*
pipeline properties
shape
resources
linear
only
uniform
only
mapping
linear projection or
algebraic synthesis
simulated
annealing or
P&R algorithm
no restrictions
scheduling
(data stream
formation)
(e.g. force-directed)
scheduling
algorithm
*) KressArray [1995]
© 2003, [email protected]
42
http://hartenstein.de
Kaiserslautern
University of
Technology
language category
both deterministic
operation
sequence
driven by:
state register
address
computation
Instruction fetch
parallel memory
bank access
© 2003, [email protected]
Programming Language Paradigms
Computer Languages
Languages f. Anti Machine
procedural sequencing: traceable, checkpointable
read next instruction,
read next data item,
goto (instr. addr.),
goto (data addr.),
jump (to instr. addr.),
jump (to data addr.),
instr. loop, loop nesting
data loop, loop nesting,
no parallel loops, escapes,
parallel loops, escapes,
instruction stream branching data stream branching
program counter
data counter(s)
massive memory
overhead avoided
cycle overhead
memory cycle overhead
overhead avoided
interleaving only
no restrictions
43
http://hartenstein.de
Basics of Binding Time
Kaiserslautern
University of
Technology
time of “Instruction Fetch”
run time
parallel computer
v.N. machine
Reconfigurable
Computing
anti machine
microprocessor
loading time
compile time
© 2003, [email protected]
44
http://hartenstein.de
Kaiserslautern
University of
Technology
Similar Programming Language Paradigms
language category
both deterministic
sequencing
driven by:
© 2003, [email protected]
Computer Languages
Xputer Languages
procedural sequencing: traceable, checkpointable
read next instruction,
read next data object,
goto (instruction addr.),
goto (data addr.),
jump (to instruction addr.),
jump (to data addr.),
instruction loop,
data loop,
instruction loop nesting
data loop nesting,
no parallel loops,
parallel data loops,
instruction loop escapes,
data loop escapes,
instruction stream branching data stream branching
45
http://hartenstein.de
Kaiserslautern
*> Declarations
University of
Technology
SouthWestScan is
loop 8 times until [1,*]
step by [-1,1]
endloop
end SouthWestScan;
JPEG zigzag scan pattern
Flowware language example
HalfZigZag;
SouthWestScan
(MoPL)
reverse(uturn(HalfZigZag))
goto PixMap[1,1]
SouthScan is
step by [0,1]
endSouthScan;
NorthEastScan is
loop 8 times until [*,1]
step by [1,-1]
x
y
dataHalfZigZag
counter
data counter
data counter
data counter
endloop
end NorthEastScan;
EastScan is
step by [1,0]
end EastScan;
endloop
end HalfZigZag;
© 2003, [email protected]
46
HalfZigZag
HalfZigZag is
EastScan
loop 3 times
SouthWestScan
SouthScan
NorthEastScan
EastScan
http://hartenstein.de
Kaiserslautern
University of
Technology
http://www.uni-kl.de
>> new Machine Paradigm <<
• why coarse grain reconfigurable ?
• terminology
• toward higher abstraction levels
• flowware languages + mapping
• why a new Machine Paradigm ?
• (co-) compilation techniques
• final remarks
© 2003, [email protected]
47
http://hartenstein.de
CS: young ? dynamic?
Kaiserslautern
University of
Technology
.. but the von Neumann
Paradigm is still the
dominant doctrine ...
after >10 technology generations ...
•
•
•
... still pushing he basic
models from the times of •
•
mainframe dinosaurs
•
•
Microelectronics is
•
•
ignored (except falling cost
•
of computational effort)
•
•
© 2003, [email protected]
1th
2nd
3rd
4th
5th
6th
7th
8th
9th
10th
11th
.......
4004
... the vN Microprocessor
8008
is a methusela, the steam
8086
engine of the silicon age.
80286
80386
80486
P5 (Pentium)
P6 (Pentium Pro / Pentium II)
Pentium III
....
48
http://hartenstein.de
Kaiserslautern
University of
Technology
MPU designs more complex
new kinds of concurrency are becoming important
chip-level multiprocessing +
simultaneous multithreading
many bugs relate to concurrency issues
greatly complicates the verification process
© 2003, [email protected]
49
http://hartenstein.de
Kaiserslautern
University of
Technology
[intel]
„Pollack‘s Law“ (simplified)
growth factor
area efficiency
performance
© 2003, [email protected]
50
µm
http://hartenstein.de
0.1
KressArray principles
Kaiserslautern
University of
Technology
• take systolic array principles
• replace classical synthesis by simulated annealing
• yields the super systolic array
• a generalization of the systolic array
• no more restricted to regular data dependencies
• now reconfigurability makes sense
© 2003, [email protected]
51
http://hartenstein.de
Kaiserslautern
University of
Technology
control-procedural vs. data-procedural
The structural domain is primarily data-stream-based:
Flowware
..... mostly not yet modelled that way:
most flowware is hidden by its indirect
instruction-stream-based implementation
Flowware provides a (data-)procedural abstraction
from the (data-stream-based) structural domain
Flowware converts „procedural vs. structural“
into „control-procedural vs. data-procedural“ ...
... a Troyan horse to introduce the structural domain
to the procedural mind set of programmers
© 2003, [email protected]
52
http://hartenstein.de
Kaiserslautern
University of
Technology
Why a dichotomy of machine paradigms?
vN: unbalanced
vN bottleneck
data stream machine:
• bad message:
caches do not help
• good message:
no vN bottleneck
• caches not needed
stolen from Bob Colwell
The anti machine has no
von Neumann bottleneck
© 2003, [email protected]
53
http://hartenstein.de
Kaiserslautern
University of
Technology
computing paradigms and methodologies
1946: machine paradigm (von Neumann)
1989: anti machine paradigm
1990: rDPU (Rabaey)
1994: anti machine high level programming language
1995: super systolic rDPA
flowware*
1980: data streams (Kung, Leiserson)
1996+: SCCC (LANL), SCORE, ASPRC, Bee (UCB), ...
1997+: discipline of distributed memory architecture
1997: configware / software partitioning compiler
© 2003, [email protected]
54
http://hartenstein.de
Kaiserslautern
University of
Technology
Flowware heading toward mainstream
•Data-stream-based Computing is heading for mainstream
–1997 SCCC (LANL) Streams-C Configurabble Computing
–SCORE (UCB) Stream Computations Organized for Reconfigurable Execution
–ASPRC (UCB) Adapting Software Pipelining for Reconfigurable Computing
–2000 Bee (UCB), ...
–Most stream-based multimedia systems, etc.
–Many other areas ....
© 2003, [email protected]
55
Flowware:
managing data streams
Software:
managing instruction streams
http://hartenstein.de
Kaiserslautern
University of
Technology
-
Matter & Antimatter: Atom and Anti Atom
Anti Matter
+
Machine paradigm:
Anti Atom
The World of Matter
Machine paradigm:
the Atom
© 2003, [email protected]
56
+
http://hartenstein.de
Kaiserslautern
University of
Technology
Matter & Antimatter of Informatics :
Anti Machine paradigm
CPU
-
+
nothing central !
DPU
+
© 2003, [email protected]
57
-
http://hartenstein.de
Kaiserslautern
University of
Technology
machine paradigm: some differences
no. of streams ³ 1
CPU
-
+
+
DPA
+
DPU
+
-
© 2003, [email protected]
58
-
+
http://hartenstein.de
Parallelism by Concurrency
Kaiserslautern
University of
Technology
independent instruction streams
+
+
-
-
© 2003, [email protected]
+
+
-
+
59
-
+
-
+
http://hartenstein.de
Dead Supercomputer Society
Kaiserslautern
University of
Technology
•ACRI
•Alliant
•American
Supercomputer
•Ametek
•Applied Dynamics
•Astronautics
•BBN
•CDC
•Convex
•Cray Computer
•Cray Research
•Culler-Harris
•Culler Scientific
•Cydrome
•Dana/Ardent/
Stellar/Stardent
[Gordon Bell, keynote at ISCA 2000]
•DAPP
•Denelcor
•Elexsi
•ETA Systems
•Evans and Sutherland
•Computer
•Floating Point Systems
•Galaxy YH-1
•Goodyear Aerospace MPP
•Gould NPL
•Guiltech
•ICL
•Intel Scientific Computers
•International Parallel
Machines
•Kendall Square Research
•Key Computer Laboratories
© 2003, [email protected]
60
•MasPar
•Meiko
•Multiflow
•Myrias
•Numerix
•Prisma
•Tera
•Thinking Machines
•Saxpy
•Scientific Computer
•Systems (SCS)
•Soviet Supercomputers
•Supertek
•Supercomputer Systems
•Suprenum
•Vitesse Electronics
http://hartenstein.de
Kaiserslautern
University of
Technology
Lacking Sense of Direction ?
„we are o.k. !“ (no new direction)
blinders:
for ignoring the impact of RC
© 2003, [email protected]
61
http://hartenstein.de
Kaiserslautern
University of
Technology
Some Supercomputing people
now looking at us
Steroids for the
aging microprocessor:
Reconfigurable
Computing
© 2003, [email protected]
62
http://hartenstein.de
Machine paradigms
Kaiserslautern
University of
Technology
von Neumann
memory
M
I/O
instruction
stream
machine
instruction
stream
DPU
CPU instruction
sequencer
-
CPU
+
(reconf.) data-stream machine
Flowware
DPU
+
-
Software (Configware)
M M M M
I/O
M
I/O
memory
data address
generator
(data sequencer)
asM**
data stream
DPU or rDPU
distributed memory architecture*
memory
M
M M M M
M
I/O
(r)DPU
© 2003, [email protected]
(r)DPA
*) the new discipline came just
in time:
http://hartenstein.de
63al.: Proc. IEEE ICECS 2002
see Herz et
heavy anti atoms: DPA = DPU array
Kaiserslautern
University of
Technology
+
+
+
DPU
DPU
DPU
DPU
DPU
DPU
DPU
DPU
DPU
DPA
-
+
-
+
-
+
© 2003, [email protected]
-
-
-
64
+
-
DPA
+
-
+
http://hartenstein.de
Distributed Memory
Kaiserslautern
University of
Technology
SA: scrambling and descrambling the data ?
Just in time: a new research area:
Application-specific distributed memory:
e. g. book by F. Catthoor et al. ...
Data address generators - 20 years research:
© 2003, [email protected]
65
http://hartenstein.de
Kaiserslautern
University of
Technology
http://www.uni-kl.de
>> compilation techniques <<
• why coarse grain reconfigurable ?
• terminology
• toward higher abstraction levels
• flowware languages + mapping
• why a new Machine Paradigm ?
• (co-) compilation techniques
• final remarks
© 2003, [email protected]
66
http://hartenstein.de
We introduce: Co-Compilation
Co-Compilation
Kaiserslautern
University of
Technology
Machine
Paradigm
partitioning compiler
Computer
mProcessor
interface
Software
running on
high level programming
language source
Reconfigurable
Accelerators
Configware
running on
Xputer
“Soft”
Machine
Paradigm
Reconfigurable
Architecture (RA)
-- instead of hardwired
© 2003, [email protected]
67
http://hartenstein.de
Kaiserslautern
University of
Technology
The Secret of Success: Co-Compilation
supporting platform-based design
High level PL source
“vN" machine
paradigm
Partitioner
anti machine
paradigm
CW
SW
Analyzer
compiler / Profiler compiler
SW code
© 2003, [email protected]
CW Code
68
could provide
the platforms
supporting
different
platforms
Resource
Parameters
http://hartenstein.de
Loop Transformation Examples
Kaiserslautern
University of
Technology
sequential processes:
loop 1-16
body
endloop
resource parameter driven
Co-Compilation
host:
loop 1-8
trigger
endloop
loop 1-8
fork
body
body
loop 1-8 loop 9-16
endloop body
body
endloop endloop
loop
unrolling
loop 1-4
trigger
endloop
loop 1-2
trigger
endloop
join
strip mining
© 2003, [email protected]
reconf.array:
69
http://hartenstein.de
Machine Paradigms
Kaiserslautern
University of
Technology
machine category
Computer (the Machine:
“v. Neumann”)
driven by:
Instruction streams
data streams (no “dataflow”)
engine principles
instruction sequencing
sequencing data streams
state register
single program counter
(multiple) data counter(s)
at run time
at load time
resource
DPU (e.g. single ALU)
DPU or DPA (DPU array) etc.
operation
sequential
parallel pipe network etc.
Communication path set-up
. fetch” )
( “instruction
data
path
*) e g. Bee project Prof. Broderson
© 2003, [email protected]
The Anti Machine
also hardwired implementations*
70
http://hartenstein.de
Kaiserslautern
University of
Technology
KressArray Family generic Fabrics:
a
few
examples
Select mode,
Select
number, width
of NNports
16
Function
Repertory
8
32
+
24
2
rDPU
4
select Nearest Neighbour (NN) Interconnect: an example
routthrough
only
more NNports:
rich Rout Resources
rout-through
and function
Examples of
2nd Level
Interconnect:
layouted over
rDPU cell no separate
routing areas !
http://kressarray.de
© 2003, [email protected]
71
http://hartenstein.de
KressArray DPSS
Kaiserslautern
University of
Technology
ALEX
Code
Architecture
Estimator
User
User
Interface
interm.
form
Selection
Architecture
Editor
Mapping
Editor
Data
Path
Synthesis
System
© 2003, [email protected]
ALE-X
Compiler
interm.
form
Bus
& I/O
Mapper
Schedule
interm.
form
HDL
Generator
Simulator
VHDL
Verilog
Design
Rules
Datapath
Generator
Generator
Scheduler
Kress
rDPU
Layout
DPSS
Power
Estimator
Power
Data
72
http://hartenstein.de
Application
Set
User
KressArray
(Design Space)
(Platform Space)
Xplorer
User
Interface
ALEX
Code
ALE-X
Compiler
Suggestion
KressArray DPSS Xplorer
Kaiserslautern
University of
Technology
Architecture
Estimator
interm.
form
interm.
form
Selection
Architecture
Editor
Mapping
Editor
interm.
form
Bus
& I/O
Mapper
Schedule
Improvement
Proposal
Generator
© 2003, [email protected]
Suggestion
statist.
Data
Delay
Estim.
Inference
Engine (FOX)
Scheduler
DPSS
Analyzer
73
http://hartenstein.de
Kaiserslautern
University of
Technology
Ulrich Nageldinger‘s Ph. D. thesis
http://hartenstein.de
click „recent talks“
this page: also link to Ph. D thesis download
© 2003, [email protected]
74
http://hartenstein.de
Kaiserslautern
University of
Technology
http://www.uni-kl.de
>> final remarks <<
• why coarse grain reconfigurable ?
• terminology
• toward higher abstraction levels
• flowware languages + mapping
• why a new Machine Paradigm ?
• (co-) compilation techniques
• final remarks
© 2003, [email protected]
75
http://hartenstein.de
Where are we heading ?
Kaiserslautern
University of
Technology
factor
2
90% by 2010
10 times more programmers
will write embedded applications
than computer software by 2010
1
0*) Department of Trade and Industry, London
© 2003, [email protected]
10
12
18
76
months
http://hartenstein.de
Kaiserslautern
University of
Technology
PS: Personal Supercomputer replaces the PC
PS: personal
supercomputer
1967
57
2007
1987
1977
nframes
PC
1997
co-compiler
µProc rDPA
.
data streams ...
morphware
© 2003, [email protected]
77
http://hartenstein.de
What‘s the problem ?
Kaiserslautern
University of
Technology
µprocessor
accelerators
Crossing the Hardware /
Software Chasm [Mike Butts]
It‘s the gap between procedural and structural mind set
Traditional CS: programming is (control-)procedural,
instruction-stream-based – sources: software
The typical programmer has problems to understand
function evaluation without machine mechanisms....
.... by signals rippling through a network of transistors.
© 2003, [email protected]
78
http://hartenstein.de
What‘s the problem ?
Kaiserslautern
University of
Technology
µprocessor
Crossing the Hardware /
Software Chasm [Mike Butts]
accelerators
structural
hemisphere
missing
The brain hurts on paradigm shift ?
no, it can‘t ...
solution only with user-friendly
SW / CW / FW co-compilers
based on anti machine paradigm
used as a Troyan Horse into CS
© 2003, [email protected]
79
Brain usage:
procedural-only
http://hartenstein.de
Annihilation?
Kaiserslautern
University of
Technology
-
avoidable
by tools ....
+
© 2003, [email protected]
+
80
http://hartenstein.de
>>> thank you <<<<<
Kaiserslautern
University of
Technology
thank you
for your
patience
© 2003, [email protected]
81
http://hartenstein.de
>>> END <<<
Kaiserslautern
University of
Technology
© 2003, [email protected]
82
http://hartenstein.de
Kaiserslautern
University of
Technology
Conclusion: all knowledge needed is available
• machine paradigm
• languages
• hw / sw partitioning methodology
• compilation techniques
• anti machine architectural resources
• sequencing methodology: hw & sw
• parallel memory IP core and module generator vendors
• anything else needed
© 2003, [email protected]
83
http://hartenstein.de
Kaiserslautern
University of
Technology
The Situation in Computing Sciences
• Computing Sciences are in a severe crisis
• New fundamentals and R&D directions are inevitable
• my mission: getting you involved
• All knowledge needed is readily available ...
• ... even from Computing Sciences
• Silicon application and EDA provide useful concepts
• Reconfigurable Computing has the remedy
© 2003, [email protected]
84
http://hartenstein.de
Configware / Flowware Compilation
Kaiserslautern
University of
Technology
M
M
M
high level source program
M
data
streams
M
M
M
M
© 2003, [email protected]
M
M
mapper
configware
M
M
M
r. Data
Path
Array
M
wrapper
intermediate
M
rDPA
M
asM
scheduler
address
generator
85
flowware
data sequencer
http://hartenstein.de
“von Neumann” Computer:
the wrong Machine Paradigm
Kaiserslautern
University of
Technology
Xputer
Xputer
LabLab
University
Kaiserslautern
University
of of
Kaiserslautern
tightly coupled
by compact
instruction code
Computer
RAM
Compiler
instructions
Sequencer
Datapath
Datapath
program
cou n ter:
hardwired
loosely coupled
by decision
data bits only
“von
Neumann”
does not support
soft data paths
Xputer:
The Soft
Machine
Paradigm
Compiler
Scheduler
“instructions”
(multiple)
sequencer
Datapath
Array
reconfigurable
d a ta
cou n ter s
also for hardwired
state register
© 2003, [email protected]
© 2001, [email protected]
RAM
Xputer
86
(anti machine)
http://hartenstein.de
Why Coarse Grain instead of FPGA ?
Kaiserslautern
University of
Technology
Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld
physical
logical
100 000 000 000
FPGA
physical
Transistors / chip
10 000 000 000
1000 000 000
FPGA
routed
10 000 000
reduced reconfigurability
overhead by up to ~ 1000
1000 000
100 000
drastically
much
fastersmaller
loading
configuration memory
a lot of more benefits
10 000
© 2003, [email protected]
~ 10 000
FPGA
logical
100 000 000
1000
1980
~ 10
1990
2000
87
2010
http://hartenstein.de