Coarse Grain Reconfigurable Architectures

Download Report

Transcript Coarse Grain Reconfigurable Architectures

IPDPS 2004
Santa Fe, NM, April 26 - 30, 2004
Reiner Hartenstein
TU Kaiserslautern
Software or Configware?
About the Digital Divide
of Parallel Computing
Preface
TU Kaiserslautern
The White House, Sept 2000:
Bill Clinton condemns the Digital Divide in America:
access to the internet
World Economic Forum 2002:
The Global Digital Divide,
disparity between the "haves" and "have nots“
The Digital Divide of Parallel Computing:
Access to Configware (CW) Solutions
© 2004, [email protected]
2
http://hartenstein.de
The „havenots“
TU Kaiserslautern
Configware methodology to
move data around more efficiently:
„havenots“ are found in the HPC community
Configware engineering as a qualification
for programming embedded systems*:
The „havenots“ are our typical CS graduates
Reconfigurable HPC is torpedoed
by deficits in education:
*) also HPC !
© 2004, [email protected]
curricular revisions are overdue
3
http://hartenstein.de
TU Kaiserslautern
Software to Configware Migration
Software to Configware Migration
is the most important source of speed-up
Hardware is just frozen Configware
this talk will illustrate the performance benfit
which may be obtained from Reconfigurable Computing
stressing coarse grain Reconfigurable Computing (RC),
point of view, this talk hardly mentions FPGAs
(But coarse grain may be always mapped onto FPGAs)
© 2004, [email protected]
4
http://hartenstein.de
>> HPC <<
TU Kaiserslautern
• HPC
• Embedded Computing
• The wrong Roadmap
• Configware Engineering
• Dual Machine Paradigms
• Speed-up Examples
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
5
http://hartenstein.de
moving data around inside the Earth Simulator
TU Kaiserslautern
Crossbar weight: 220 t, 3000 km of cable,
ES 20: TFLOPS
© 2004, [email protected]
5120 Processors, 5000 pins each
6
http://hartenstein.de
data are moved around by software
TU Kaiserslautern
i.e. by memory-cycle-hungry instruction
streams which fully hit the memory wall
(slower than CPU clock by 2 orders of magnitude)
extremely
unbalanced
© 2004, [email protected]
7
stolen from Bob Colwell
http://hartenstein.de
TU Kaiserslautern
path of least resistance*:
avoiding a paradigm shift
*) [Michel
on Dubois]
Many researchers seem never to stop working
sophisticated solutions for marginal improvements ...
... continously ignoring methodologies promising
speed-ups by orders of magnitude ....
... continue to bang their heads
against the memory wall
blinders
to ignore
the impact
of morphware
© 2004, [email protected]
8
instead of
http://hartenstein.de
… understand only this parallelism solution:
TU Kaiserslautern
the instruction-stream-based approach
the data-stream-based approach
has no von
Neumann
bottleneck
von
Neumann
bottlenecks
© 2004, [email protected]
9
http://hartenstein.de
>> Embedded Computing <<
TU Kaiserslautern
• HPC
• Embedded Computing
• The wrong Roadmap
• Configware Engineering
• Dual Machine Paradigms
• Speed-up Examples
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
10
http://hartenstein.de
History of Machine Models
TU Kaiserslautern
mainframe age
compile
main
frame.
scientific computing example:
molecular dynamics, astrophysics ,
plasma physics,
computer age (PC age)
hydrodynamics:
compile
MD-GRAPE-2 PCI board [1997]
4 chips for N-body simulation
converts a PC to 64 GFlops
µProc. accel.
procedural mind set:
instruction-stream-based
users: RIKEN institute, ARI, Heidelberg, etc.
(coordinates by Makimtos wave)
1967
1957
© 2004, [email protected]
2007
1987
1977
1997
11
http://hartenstein.de
History of Machine Models
TU Kaiserslautern
mainframe age
compile
main
frame.
computer age (PC age)
compile design by hardware guys
µProc. accel.
procedural mind set:
instruction-stream-based
structural mind set:
data-stream-based
(coordinates by Makimtos wave)
1967
1957
© 2004, [email protected]
2007
1987
1977
1997
12
http://hartenstein.de
the hardware / Software Chasm:
TU Kaiserslautern
µprocessor
accelerators
It‘s the gap between procedural (instruction-streambased) and structural (datastream-based) mind set
typical programmers don‘t understand function evaluation
without machine mechanisms (counters, state registers)
© 2004, [email protected]
13
http://hartenstein.de
Growth Rate of Embedded Software
TU Kaiserslautern
already to-day, more than 98%
of all microprocessors
are used within embedded systems
factor
2
>10 times more
programmers will write
embedded applications than
computer software by 2010
1
0
*) Department
of Trade and Industry, London
© 2004, [email protected]
10
12
18
14
months
http://hartenstein.de
TU Kaiserslautern
typical CS graduates: the „havenots“
To-day, „typical“ CS graduates are
unqualified for this labor market
… cannot cope with Hardware / Configware /
Software partitioning issues
… cannot implement Configware
© 2004, [email protected]
15
http://hartenstein.de
TU Kaiserslautern
the current CS mind set is based
on the Submarine Model
This model does not suport
Hardware / Configware /
Software partitioning
Algorithm
procedural high level
Programming Language
Assembly Language
Hardware invisible:
under the surface
Hardware
© 2004, [email protected]
16
http://hartenstein.de
Hardware / Configware / Software Partitioning
skills urgently needed
.
TU Kaiserslautern
.
Software to Configware Algorithm
Migration is the most
important source of
speed-up
SW
Hardware is just
frozen Configware
© 2004, [email protected]
partitioning
to cope with each of it:
SW, CW, HW
SW/HW
SW/CW/HW
HW
or: to cope with any
combination of co-design
CW
17
http://hartenstein.de
By the way ...
... the oldest and largest conference in the field:
TU Kaiserslautern
International Conference on
Field-Programmable Logic
and Applications (FPL)
http://fpl.org
Aug. 20 – Sept 1, 2004, Antwerp, Belgium
... going into every type of application
µProc. accel.
© 2004, [email protected]
288 submissions !
they all work on high http://hartenstein.de
performance
18
Dominance of the Submarine Model ...
TU Kaiserslautern
(procedural)
structurally
disabled
Hardware
... indicates, that our CS education
system produces zillions of mentally
disabled CS graduates
… disabled to cope with
solutions other than
instruction-stream-based
© 2004, [email protected]
19
http://hartenstein.de
CS Education
TU Kaiserslautern
You cannot
*teach Hardware
to a Programmer
have
structural natural
*) efficiently
have not
procedural
But to a Hardware Guy
you always can
teach Programming
© 2004, [email protected]
20
http://hartenstein.de
>> the wrong Roadmap <<
TU Kaiserslautern
• HPC
• Embedded Computing
• the wrong Roadmap
• Configware Engineering
• Dual Machine Paradigms
• Speed-up Examples
http://www.uni-kl.de
• Final Remarks
© 2004, [email protected]
21
http://hartenstein.de
Completely wrong roadmap
TU Kaiserslautern
beef up old architectural
principles by new technology?
growth factor
area efficiency
„Pollack‘s Law“
(simplified)
performance
[intel]
© 2004, [email protected]
µm
0.1
22
... the CPU is a methusela,
the steam engine
of the silicon age
http://hartenstein.de
TU Kaiserslautern
Completely wrong mind set
The key problem, the memory wall,
cannot be solved by new CPU technology
The vN paradigm is not a communication paradigm
Its monopoly creates a completely wrong mind set
We need a 2nd machine paradigm (a 2nd mind set ...)
We need an architectural communication paradigm
But we need both paradigms: a dichotomy
© 2004, [email protected]
23
http://hartenstein.de
TU Kaiserslautern
3rd machine model became mainstream
mainframe age
compile
main
frame
instructionstream-based
computer age (PC age)
compile design
µProc. accel.
µProc. rDPA
1967
1957
(Makimtos wave)
© 2004, [email protected]
morphware age
2007
1987
1977
1997
24
http://hartenstein.de
TU Kaiserslautern
>> Configware Engineering <<
• Supercomputing (HPC)
• Embedded Computing
• The wrong Roadmap
• Configware Engineering
• Dual Machine Paradigms
• Speed-up Examples
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
25
http://hartenstein.de
de facto Duality of RAM-based platforms
TU Kaiserslautern
We now have 2 types of programmable platforms
soft hardware
morphware [DARPA]
hardware viewed as
frozen configware:
Just earlier binding
traditional
RAM-based platform CPU
„running“ on it:
machine paradigm
© 2004, [email protected]
software
2nd paradigm
new
morphware (FPGA, rDPA ..)
configware
von Neumann etc.:
anti machine:
instruction-stream-based data-stream-based
26
http://hartenstein.de
....
the brain hurts
TU Kaiserslautern
The HPC scene believed to be smart,
when smiling about us CW guys
[Gordon Bell]
Others experienced, that the brain hurts,
when trying the paradigm shift
morphware: fastest
growing sector
of the IC market
CW has become
mainstream ...
... going into every
type of application
[Gordon Bell]
© 2004, [email protected]
27
http://hartenstein.de
From Software to Configware Industry
TU Kaiserslautern
Software
Industry
Growing Configware Industry
Repeat Success Story by
a 2nd Machine Paradigm !
Software Industry’s
Secret of Success
Procedural
personalization
via RAM-based 1)
.
2) Machine Paradigm
computer age (PC age)
© 2004, [email protected]
morphware age
compile
rDPA
µProc.
1967
1957
structural
personalization:
RAM-based
anti machine
2007
1987
1977
28
1997
http://hartenstein.de
benefit from RAM-based & 2nd paradigm
TU Kaiserslautern
1)
2)
RAM-based platform needed for:
• flexibility, programmability
• avoiding the need of specific silicon
mask cost:
currently 2 mio $
- rapidly growing
simple 2nd machine paradigm needed as a common model:
• to avoid the need of circuit expertize
• needed to to educate zillions of programmers
By the way: relocatability is more difficult, but not hopeless
(vN relocatability is based on the von Neumann bottleneck)
high price
© 2004, [email protected]
29
http://hartenstein.de
TU Kaiserslautern
Nick Tredennick’s Paradigm Shifts
explain the differences
Software Engineering
CPU
software
resources: fixed
algorithm: variable
1 programming
source needed
Configware Engineering
configware
flowware
© 2004, [email protected]
resources: variable
algorithm: variable
30
2 programming
sources needed
http://hartenstein.de
Compilation: Software vs. Configware
TU Kaiserslautern
Software
Engineering
source program
Configware
Engineering
placement source „program“
& routing
mapper
software
compiler
configware
compiler
data scheduler
software code
configware code
© 2004, [email protected]
31
flowware code
http://hartenstein.de
TU Kaiserslautern
Flowware programs
data streams
Flowware defines:
... which data item
time
at which time
at which port
x
x
x
DPA
time
x
x
x
|
x
x
x
|
|
x x x
x x x -
port #
- - - x x x
time
- - - - x x x
x x x - -
- - - - - x x x
port #
|
|
|
|
|
|
|
|
|
|
|
x
x
x
© 2004, [email protected]
input data streams
time
x
x
x
32
port #
output data streams
|
x
x
x
http://hartenstein.de
*) no confusion, please:
no „dataflow machine“ !!!
Flowware:
TU Kaiserslautern
not new
Flowware:
data stream* ...
around 1980
mainframe age
compile
main
frame
computer age (PC age)
compile design
µProc. accel. µProc. rDPA
1967
1957
(Makimtos wave)
© 2004, [email protected]
morphware age
2007
1987
1977
1997
33
http://hartenstein.de
TU Kaiserslautern
data streams*: not new
1980: data streams (Kung, Leiserson: systolic arrays)
1989: data-stream-based Xputer architecture
1990: rDPU (Rabaey)
1994: Flowware Language MoPL (Becker et al.)
1995: super systolic array (rDPA) + DPSS tool (Kress)
1996+: Stream-C language, SCCC (Los Alamos),
SCORE, ASPRC, Bee (UC Berkeley), ...
1996+: streaming languages (Stanford et al.)
1996: configware / software partitioning compiler (Becker)
© 2004, [email protected]
34
http://hartenstein.de
TU Kaiserslautern
>> Dual Machine Paradigms <<
• HPC
• Embedded Computing
• The wrong Roadmap
• Configware Engineering
• Dual Machine Paradigms
• Speed-up Examples
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
35
http://hartenstein.de
TU Kaiserslautern
µprocessor
Why a new machine
rDPA
paradigm ???
The anti machine as the 2nd paradigm
is the key to curricular innovation
... a Troyan horse to introduce the structural domain
to the procedural-only mind set of programmers
Programming by flowware instead of software
is very easy to learn (... same language primitives)
Flowware education: no fully fledged hardware
expert needed to program embedded systems
© 2004, [email protected]
36
http://hartenstein.de
asM
data
counter
RAM
memory
CPU
memory
bank
DPU
asM
data stream machine
(anti machine)
progra
m
counter
(r)DPA
without
sequencer
von Neumann bottleneck
asM
asM
(r)DPA
asM
........
TU Kaiserslautern
von Neumann vs.
anti machine
© 2004, [email protected]
37
........
asM
instruction stream machine
(von Neumann etc.)
asM: auto-sequencing Memory
asMA: auto-sequencing Memory Array
http://hartenstein.de
Behavior of the Counter
TU Kaiserslautern
memory
bank
data
counter
asM
asM
asM
(r)DPA
CPU
asM
DPU
asM
progra
m
counter
asM
........
© 2004, [email protected]
38
http://hartenstein.de
Counters: the same micro architecture ?
TU Kaiserslautern
instruction stream machine:
(von Neumann etc.)
CPU
DPU
progra
m
counter
AGU: address
generator unit
data stream machine
(anti machine)
memory
bank
asM
data
counter
yes, is possible, but for data counters ...
... a much better AGU methodology is available*
*) for history of AGUs see Herz et al.: Proc. ICECS 2002, Dubrovnik, Croatia
© 2004, [email protected]
39
http://hartenstein.de
(r)DPA
TU Kaiserslautern
commercial rDPA example:
PACT XPP - XPU128
XPP128 rDPA
ALU
• Full 32 or 24 Bit Design working silicon
• 2 Configuration Hierarchies
• Evaluation Board available, and
• XDS Development Tool with Simulator
© 2004, [email protected]
buses
not
shown
Ctrl
CFG
rDPU
PAE
core
© PACT AG, Munich http://pactcorp.com
40
http://hartenstein.de
mapping algorithms efficently onto rDPA
TU Kaiserslautern
by DPSS: based on simulated annealing
SNN filter on KressArray
rout thru only
array size:
10 x 16
= 160 rDPUs
Legend:
rDPU not used
[Ulrich Nageldinger]
© 2004, [email protected]
backbus connect
used for
routing only
backbus
connect
41
operator and routing
port location
not
used marker
http://hartenstein.de
symbiosis of machine models
TU Kaiserslautern
mainframe age
compile
main
frame
computer age (PC age)
morphware age
compile design
co-compiler
µProc. accel.
replace PC by PS
µProc. rDPA
symbiosis
1967
1957
(Makimtos wave)
© 2004, [email protected]
2007
1987
1977
1997
42
http://hartenstein.de
Software / Configware Co-Compilation
Jürgen Becker’s CoDe-X, 1996
TU Kaiserslautern
High level PL source
“vN" machine
paradigm
Partitioner
anti machine
paradigm
CW
SW
Analyzer
compiler / Profiler compiler
SW code
© 2004, [email protected]
CW Code
43
supporting
different
platforms
Resource
Parameters
http://hartenstein.de
>> Speed-up Examples <<
TU Kaiserslautern
• HPC
• Embedded Computing
• The wrong Roadmap
• Configware Engineering
• Dual Machine Paradigms
• Speed-up Examples
http://www.uni-kl.de
• Final Remarks
© 2004, [email protected]
44
http://hartenstein.de
TU Kaiserslautern
Better solutions by Configware
instead of software
methodologies not new: high level synthesis (1980+)
loop transformations (1970+)
many other areas
Memory cycles minimized
e.g.: no instruction fetch at run time & other effects
No cache misses!
Memory access for data: caches do not help anyhow
Loop xforms: no intra-stream data memory cycles
Complex address computation: no memory cycles
© 2004, [email protected]
45
http://hartenstein.de
speed-up examples
TU Kaiserslautern
key issue: algorithmic cleverness
platform
application example
PACT Xtreme
4-by-4 array 16 tap FIR filter
[2003]
grid-based DRC**
MoM anti
machine with 1-metal 1-poly nMOS***
DPLA* [1983] 256 reference patterns
CPU 2 FPGA
migrate several simple
[FPGA 2004] application exampes
DSP 2 FPGA
from fastest DSP:
[Xilinx 20042] 10 gMACs to 1 teraMAC
speed-up factor
method
x16 MOPS/mW
straight
forward
> x1000
multiple
aspects
x7 – x46
(compute time)
X 100
(compute time)
hi level
synthesis
(computation time)
not spec.
*) DPLA: MPC fabr. via E.I.S. multi univ. project **) Design Rule Check
2) Wim Roelandts
***) for 10-metal 3-poly cMOS expected: >> x10,000
© 2004, [email protected]
46
http://hartenstein.de
TU Kaiserslautern
Software to Configware Migration:
(RAW’99 at Orlando)
Ulrich Nageldinger‘s talk
about KressArray Xplorer:
question by a highly respected
industrial senior researcher:
„But you can‘t implement decisions!“
(symptom of the configware/software chasm)
© 2004, [email protected]
47
http://hartenstein.de
section of a major pipe network on rDPA
TU Kaiserslautern
hypothetical branching example
to illustrate time-to-space migration
S = R + (if C then A else B endif);
R B A
C =1
+
S
clock
200 MHz
(5 nanosec)
© 2004, [email protected]
C=1
simple conservative CPU example
read instruction
instruction decoding
if C
then read A read operand*
operate & reg. transfers
read instruction
if not C
then read B instruction decoding
read instruction
instruction decoding
add & store
operate & reg. transfers
store result
total
memory nano
cycles seconds
1
100
1
100
1
100
1
100
1
5
100
500
*) if no intermed. storage in register file
48
http://hartenstein.de
rDPA (coarse grain) vs. FPGA (fine grain)
TU Kaiserslautern
roughly:
performance
(MOPS/mW,
o‘o‘ magnitude)
µProc 0
DSP 1
FPGA 2
rDPA 3
hardwired 3
© 2004, [email protected]
Status: ~1998
roughly:
area efficiency
(trans/chip,
o‘o‘ magnitude)
µProc 0
commodity FPGA 2
rDPA 4
hardwired 4
49
http://hartenstein.de
Why the speed-up ...
TU Kaiserslautern
... although FPGA is clock slower by x 3 or even more
(most know-how from „high level synthesis“ discipline)
support operations: no clock nor memory cycle
decisions without memory cycles nor clock cycles
moving operator to the data stream (before run time)
most „data fetch“ without memory cycle
© 2004, [email protected]
50
http://hartenstein.de
>> Final Remarks <<
TU Kaiserslautern
• HPC
• Embedded Computing
• The wrong Roadmap
• Configware Engineering
• Dual Machine Paradigms
• Speed-up Examples
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
51
http://hartenstein.de
First Indications of Change
TU Kaiserslautern
PARS & Speed-up, Basel, Switzerland, March 2003: keynote address*
10th RAW at IPDPS, Nice, France, April 2003: after
a decade of non-overlap: first IPDPS people coming
PDP’04, La Coruna, Spain, Febr. 2004: keynote address*
IPDPS, Santa Fe, NM, USA, April 2004: keynote address*
HPC Asia 2004 - 7th Int‘l Conference on High Performance Computing,
July 20-22, 2004 Omiya Sonic City, Tokyo Area, Japan:
Workshop on Reconfigurable Systems f. HPC (RHPC) + keynote address*
SBAC-PAD 2004 - 16th Symposium on Computer Architecture and High
Performance Computing, Foz do Iguacu, PR, Brazil, October 27-29,
2004: topic area explicitely: Reconfigurable Systems
HPCA-11, 11th International Symposium on High-Performance Computer
Architecture, San Francisco, Febr. 12-16, 2005: topic area explicitely:
Embedded and reconfigurable architectures
© 2004, [email protected]
52
*) keynote speaker: http://hartenstein.de
HPC experts coming ...
TU Kaiserslautern
example: N-body problem went configware
paper already
at FPL 1999
http://fpl.org
Simulation of Star Clusters: x10 speed-up
by supercomputer-to-morphware migration
(also molecular biology et al.)
Configware by
Reinhard Männer, University of Mannheim
Gottfried Kirch
HPC pioneer since 1976 (Physics Dept Heidelberg)
Astrophysics by
Rainer Spurzem, University of Heidelberg
ARI, Astrononisches Rechen-Institut, founded 1700
in Berlin, moved 1945 to Heidelberg by August Kopff
© 2004, [email protected]
53
http://hartenstein.de
Conclusions
TU Kaiserslautern
RC has become mainstream in all kinds of applications
CS education deficits: a curricular revision is overdue
... by a merger with the embedded systems mind set
We need an academic grass roots movement, for ....
...free material & tools for undergraduate lab courses
to program and emulate small SW/CW/HW examples
all know-how needed readily available:
© 2004, [email protected]
54
http://hartenstein.de
TU Kaiserslautern
END
© 2004, [email protected]
55
http://hartenstein.de