Coarse Grain Reconfigurable Architectures

Download Report

Transcript Coarse Grain Reconfigurable Architectures

Reconfigurable HPC
May 14, 2004 , TU Tallinn, Estonia
Reconfigurable HPC
Reiner Hartenstein
TU Kaiserslautern
part 1
Introduction
Preface
TU Kaiserslautern
The White House, Sept 2000:
Bill Clinton condemns the Digital Divide in America:
access to the internet
World Economic Forum 2002:
The Global Digital Divide,
disparity between the "haves" and "have nots“
The Digital Divide of Parallel Computing:
Access to Configware (CW) Solutions
© 2004, [email protected]
2
http://hartenstein.de
The „havenots“
TU Kaiserslautern
Configware methodology to
move data around more efficiently:
„havenots“ are found in the HPC community
Configware engineering as a qualification
for programming embedded systems*:
The „havenots“ are our typical CS graduates
Reconfigurable HPC is torpedoed
by deficits in education:
*) also HPC !
© 2004, [email protected]
curricular revisions are overdue
3
http://hartenstein.de
TU Kaiserslautern
Software to Configware Migration
Software to Configware Migration
is the most important source of speed-up
Hardware is just frozen Configware
this talk will illustrate the performance benfit
which may be obtained from Reconfigurable Computing
stressing coarse grain Reconfigurable Computing (RC),
point of view, this talk hardly mentions FPGAs
(But coarse grain may be always mapped onto FPGAs)
© 2004, [email protected]
4
http://hartenstein.de
>> HPC <<
TU Kaiserslautern
• HPC
• Embedded Computing
• The wrong Roadmap
• Configware Engineering
• Dual Machine Paradigms
• Speed-up Examples
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
5
http://hartenstein.de
moving data around inside the Earth Simulator
TU Kaiserslautern
Crossbar weight: 220 t, 3000 km of cable,
ES 20: TFLOPS
© 2004, [email protected]
5120 Processors, 5000 pins each
6
http://hartenstein.de
data are moved around by software
TU Kaiserslautern
i.e. by memory-cycle-hungry instruction
streams which fully hit the memory wall
(slower than CPU clock by 2 orders of magnitude)
extremely
unbalanced
© 2004, [email protected]
7
stolen from Bob Colwell
http://hartenstein.de
TU Kaiserslautern
path of least resistance*:
avoiding a paradigm shift
*) [Michel
on Dubois]
Many researchers seem never to stop working
sophisticated solutions for marginal improvements ...
... continously ignoring methodologies promising
speed-ups by orders of magnitude ....
... continue to bang their heads
against the memory wall
blinders
to ignore
the impact
of morphware
© 2004, [email protected]
8
instead of
http://hartenstein.de
… understand only this parallelism solution:
TU Kaiserslautern
the instruction-stream-based approach
the data-stream-based approach
has no von
Neumann
bottleneck
von
Neumann
bottlenecks
© 2004, [email protected]
9
http://hartenstein.de
>> Embedded Computing <<
TU Kaiserslautern
• HPC
• Embedded Computing
• The wrong Roadmap
• Configware Engineering
• Dual Machine Paradigms
• Speed-up Examples
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
10
http://hartenstein.de
The History of
Paradigm Shifts
TU Kaiserslautern
TTL
1967
1957
custom
LSI,
MSI
© 2004, [email protected]
“The Programmable System-on-a-Chip
is the next wave“
µproc.,
memory
1987
2nd Design Crisis
standard
1st Design Crisis
“Mainstream Silicon Application
is switching every 10 Years”
ASICs,
accel’s
1977
11
2007
1997
?
?
http://hartenstein.de
Makimoto’s 3rd Wave
TU Kaiserslautern
• Fine Grain Subsystems (FPGAs):
– 1st half of 3rd wave
– universal (but less efficient)
• Coarse Grain Subsystems:
– 2nd half of 3rd wave
– domain-specific
– much more flexible than 2nd half of 2rd wave
© 2004, [email protected]
12
http://hartenstein.de
How’s next Wave ?
TU Kaiserslautern
standard
hardwired
procedural programming
1967
1957
1987
1977
structural programming
FPGAs
1997
4th wave ?
Coarse
2007 grain
RAs
?
?
custom
algorithm: fixed
algorithm: variable
algorithm: variable
resources: fixed
resources: fixed
resources: variable
Tredennick’s
Paradigm Shifts
© 2004, [email protected]
Hartenstein’s
Curve
no further wave !
13
http://hartenstein.de
History of Silicon Application
TU Kaiserslautern
3 different mind sets
hardware people
TTL
1957
1967
CS
people
µproc.,
memory
LSI,
MSI
new breed needed
1987
ASICs,
accel’s
1977
FPGAs
1997
2007
soft
CPUs
coarse
grain
Common terminology needed
© 2004, [email protected]
14
http://hartenstein.de
History of Machine Models
TU Kaiserslautern
mainframe age
compile
main
frame.
scientific computing example:
molecular dynamics, astrophysics ,
plasma physics,
computer age (PC age)
hydrodynamics:
compile
MD-GRAPE-2 PCI board [1997]
4 chips for N-body simulation
converts a PC to 64 GFlops
µProc. accel.
procedural mind set:
instruction-stream-based
users: RIKEN institute, ARI, Heidelberg, etc.
(coordinates by Makimtos wave)
1967
1957
© 2004, [email protected]
2007
1987
1977
1997
15
http://hartenstein.de
History of Machine Models
TU Kaiserslautern
mainframe age
compile
main
frame.
computer age (PC age)
compile design by hardware guys
µProc. accel.
procedural mind set:
instruction-stream-based
structural mind set:
data-stream-based
(coordinates by Makimtos wave)
1967
1957
© 2004, [email protected]
2007
1987
1977
1997
16
http://hartenstein.de
the hardware / Software Chasm:
TU Kaiserslautern
µprocessor
accelerators
It‘s the gap between procedural (instruction-streambased) and structural (datastream-based) mind set
typical programmers don‘t understand function evaluation
without machine mechanisms (counters, state registers)
© 2004, [email protected]
17
http://hartenstein.de
Growth Rate of Embedded Software
TU Kaiserslautern
already to-day, more than 98%
of all microprocessors
are used within embedded systems
factor
2
>10 times more
programmers will write
embedded applications than
computer software by 2010
1
0
*) Department
of Trade and Industry, London
© 2004, [email protected]
10
12
18
18
months
http://hartenstein.de
TU Kaiserslautern
typical CS graduates: the „havenots“
To-day, „typical“ CS graduates are
unqualified for this labor market
… cannot cope with Hardware / Configware /
Software partitioning issues
… cannot implement Configware
© 2004, [email protected]
19
http://hartenstein.de
TU Kaiserslautern
the current CS mind set is based
on the Submarine Model
This model does not suport
Hardware / Configware /
Software partitioning
Algorithm
procedural high level
Programming Language
Assembly Language
Hardware invisible:
under the surface
Hardware
© 2004, [email protected]
20
http://hartenstein.de
Hardware / Configware / Software Partitioning
skills urgently needed
.
TU Kaiserslautern
.
Software to Configware Algorithm
Migration is the most
important source of
speed-up
SW
Hardware is just
frozen Configware
© 2004, [email protected]
partitioning
to cope with each of it:
SW, CW, HW
SW/HW
SW/CW/HW
HW
or: to cope with any
combination of co-design
CW
21
http://hartenstein.de
By the way ...
... the oldest and largest conference in the field:
TU Kaiserslautern
International Conference on
Field-Programmable Logic
and Applications (FPL)
http://fpl.org
Aug. 20 – Sept 1, 2004, Antwerp, Belgium
... going into every type of application
µProc. accel.
© 2004, [email protected]
288 submissions !
they all work on high http://hartenstein.de
performance
22
Dominance of the Submarine Model ...
TU Kaiserslautern
(procedural)
structurally
disabled
Hardware
... indicates, that our CS education
system produces zillions of mentally
disabled CS graduates
… disabled to cope with
solutions other than
instruction-stream-based
© 2004, [email protected]
23
http://hartenstein.de
CS Education
TU Kaiserslautern
You cannot
*teach Hardware
to a Programmer
have
structural natural
*) efficiently
have not
procedural
But to a Hardware Guy
you always can
teach Programming
© 2004, [email protected]
24
http://hartenstein.de
>> the wrong Roadmap <<
TU Kaiserslautern
• HPC
• Embedded Computing
• the wrong Roadmap
• Configware Engineering
• Dual Machine Paradigms
• Speed-up Examples
http://www.uni-kl.de
• Final Remarks
© 2004, [email protected]
25
http://hartenstein.de
Completely wrong roadmap
TU Kaiserslautern
beef up old architectural
principles by new technology?
growth factor
area efficiency
„Pollack‘s Law“
(simplified)
performance
[intel]
© 2004, [email protected]
µm
0.1
26
... the CPU is a methusela,
the steam engine
of the silicon age
http://hartenstein.de
TU Kaiserslautern
Completely wrong mind set
The key problem, the memory wall,
cannot be solved by new CPU technology
The vN paradigm is not a communication paradigm
Its monopoly creates a completely wrong mind set
We need a 2nd machine paradigm (a 2nd mind set ...)
We need an architectural communication paradigm
But we need both paradigms: a dichotomy
© 2004, [email protected]
27
http://hartenstein.de
TU Kaiserslautern
3rd machine model became mainstream
mainframe age
compile
main
frame
instructionstream-based
computer age (PC age)
compile design
µProc. accel.
µProc. rDPA
1967
1957
(Makimtos wave)
© 2004, [email protected]
morphware age
2007
1987
1977
1997
28
http://hartenstein.de
TU Kaiserslautern
>> Configware Engineering <<
• Supercomputing (HPC)
• Embedded Computing
• The wrong Roadmap
• Configware Engineering
• Dual Machine Paradigms
• Speed-up Examples
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
29
http://hartenstein.de
de facto Duality of RAM-based platforms
TU Kaiserslautern
We now have 2 types of programmable platforms
soft hardware
morphware [DARPA]
hardware viewed as
frozen configware:
Just earlier binding
traditional
RAM-based platform CPU
„running“ on it:
machine paradigm
© 2004, [email protected]
software
2nd paradigm
new
morphware (FPGA, rDPA ..)
configware
von Neumann etc.:
anti machine:
instruction-stream-based data-stream-based
30
http://hartenstein.de
....
the brain hurts
TU Kaiserslautern
The HPC scene believed to be smart,
when smiling about us CW guys
[Gordon Bell]
Others experienced, that the brain hurts,
when trying the paradigm shift
morphware: fastest
growing sector
of the IC market
CW has become
mainstream ...
... going into every
type of application
[Gordon Bell]
© 2004, [email protected]
31
http://hartenstein.de
From Software to Configware Industry
TU Kaiserslautern
Software
Industry
Growing Configware Industry
Repeat Success Story by
a 2nd Machine Paradigm !
Software Industry’s
Secret of Success
Procedural
personalization
via RAM-based 1)
.
2) Machine Paradigm
computer age (PC age)
compile
© 2004, [email protected]
morphware age
rDPA
µProc.
1967
1957
structural
personalization:
RAM-based
anti machine
2007
1987
1977
32
1997
http://hartenstein.de
benefit from RAM-based & 2nd paradigm
TU Kaiserslautern
1)
2)
RAM-based platform needed for:
• flexibility, programmability
• avoiding the need of specific silicon
mask cost:
currently 2 mio $
- rapidly growing
simple 2nd machine paradigm needed as a common model:
• to avoid the need of circuit expertize
• needed to to educate zillions of programmers
By the way: relocatability is more difficult, but not hopeless
(vN relocatability is based on the von Neumann bottleneck)
high price
© 2004, [email protected]
33
http://hartenstein.de
TU Kaiserslautern
Nick Tredennick’s Paradigm Shifts
explain the differences
Software Engineering
CPU
software
resources: fixed
algorithm: variable
1 programming
source needed
Configware Engineering
configware
flowware
© 2004, [email protected]
resources: variable
algorithm: variable
34
2 programming
sources needed
http://hartenstein.de
Compilation: Software vs. Configware
TU Kaiserslautern
Software
Engineering
source program
Configware
Engineering
placement source „program“
& routing
mapper
software
compiler
configware
compiler
data scheduler
software code
configware code
© 2004, [email protected]
35
flowware code
http://hartenstein.de
TU Kaiserslautern
Flowware programs
data streams
Flowware defines:
... which data item
time
at which time
at which port
x
x
x
DPA
time
x
x
x
|
x
x
x
|
|
x x x
x x x -
port #
- - - x x x
time
- - - - x x x
x x x - -
- - - - - x x x
port #
|
|
|
|
|
|
|
|
|
|
|
x
x
x
© 2004, [email protected]
input data streams
time
x
x
x
36
port #
output data streams
|
x
x
x
http://hartenstein.de
*) no confusion, please:
no „dataflow machine“ !!!
Flowware:
TU Kaiserslautern
not new
Flowware:
data stream* ...
around 1980
mainframe age
compile
main
frame
computer age (PC age)
compile design
µProc. accel. µProc. rDPA
1967
1957
(Makimtos wave)
© 2004, [email protected]
morphware age
2007
1987
1977
1997
37
http://hartenstein.de
TU Kaiserslautern
data streams*: not new
1980: data streams (Kung, Leiserson: systolic arrays)
1989: data-stream-based Xputer architecture
1990: rDPU (Rabaey)
1994: Flowware Language MoPL (Becker et al.)
1995: super systolic array (rDPA) + DPSS tool (Kress)
1996+: Stream-C language, SCCC (Los Alamos),
SCORE, ASPRC, Bee (UC Berkeley), ...
1996+: streaming languages (Stanford et al.)
1996: configware / software partitioning compiler (Becker)
© 2004, [email protected]
38
http://hartenstein.de
TU Kaiserslautern
>> Dual Machine Paradigms <<
• HPC
• Embedded Computing
• The wrong Roadmap
• Configware Engineering
• Dual Machine Paradigms
• Speed-up Examples
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
39
http://hartenstein.de
TU Kaiserslautern
µprocessor
Why a new machine
rDPA
paradigm ???
The anti machine as the 2nd paradigm
is the key to curricular innovation
... a Troyan horse to introduce the structural domain
to the procedural-only mind set of programmers
Programming by flowware instead of software
is very easy to learn (... same language primitives)
Flowware education: no fully fledged hardware
expert needed to program embedded systems
© 2004, [email protected]
40
http://hartenstein.de
asM
data
counter
RAM
memory
CPU
memory
bank
DPU
asM
data stream machine
(anti machine)
progra
m
counter
(r)DPA
without
sequencer
von Neumann bottleneck
asM
asM
(r)DPA
asM
........
TU Kaiserslautern
von Neumann vs.
anti machine
© 2004, [email protected]
41
........
asM
instruction stream machine
(von Neumann etc.)
asM: auto-sequencing Memory
asMA: auto-sequencing Memory Array
http://hartenstein.de
Behavior of the Counter
TU Kaiserslautern
memory
bank
data
counter
asM
asM
asM
(r)DPA
CPU
asM
DPU
asM
progra
m
counter
asM
........
© 2004, [email protected]
42
http://hartenstein.de
Counters: the same micro architecture ?
TU Kaiserslautern
instruction stream machine:
(von Neumann etc.)
CPU
DPU
progra
m
counter
AGU: address
generator unit
data stream machine
(anti machine)
memory
bank
asM
data
counter
yes, is possible, but for data counters ...
... a much better AGU methodology is available*
*) for history of AGUs see Herz et al.: Proc. ICECS 2002, Dubrovnik, Croatia
© 2004, [email protected]
43
http://hartenstein.de
TU Kaiserslautern
The dichotomy of models
• Note for von Neumann:
state register is with the CPU
• Note for the anti machine:
state register is with memory bank /
state registers are within memory banks
© 2004, [email protected]
44
http://hartenstein.de
(r)DPA
TU Kaiserslautern
commercial rDPA example:
PACT XPP - XPU128
XPP128 rDPA
ALU
• Full 32 or 24 Bit Design working silicon
• 2 Configuration Hierarchies
• Evaluation Board available, and
• XDS Development Tool with Simulator
© 2004, [email protected]
buses
not
shown
Ctrl
CFG
rDPU
PAE
core
© PACT AG, Munich http://pactcorp.com
45
http://hartenstein.de
XPP64A: Platform Development Board
TU Kaiserslautern
- SDR Board In Debug Phase -> XPP64A Chips from STMicro Fab
- Assembly & Test / Available March 2003
© 2004, [email protected]
46
http://hartenstein.de
symbiosis of machine models
TU Kaiserslautern
mainframe age
compile
main
frame
computer age (PC age)
morphware age
compile design
co-compiler
µProc. accel.
replace PC by PS
µProc. rDPA
symbiosis
1967
1957
(Makimtos wave)
© 2004, [email protected]
2007
1987
1977
1997
47
http://hartenstein.de
>> Speed-up Examples <<
TU Kaiserslautern
• HPC
• Embedded Computing
• The wrong Roadmap
• Configware Engineering
• Dual Machine Paradigms
• Speed-up Examples
http://www.uni-kl.de
• Final Remarks
© 2004, [email protected]
48
http://hartenstein.de
TU Kaiserslautern
Better solutions by Configware
instead of software
methodologies not new: high level synthesis (1980+)
loop transformations (1970+)
many other areas
Memory cycles minimized
e.g.: no instruction fetch at run time & other effects
No cache misses!
Memory access for data: caches do not help anyhow
Loop xforms: no intra-stream data memory cycles
Complex address computation: no memory cycles
© 2004, [email protected]
49
http://hartenstein.de
speed-up examples
TU Kaiserslautern
key issue: algorithmic cleverness
platform
application example
PACT Xtreme
4-by-4 array 16 tap FIR filter
[2003]
grid-based DRC**
MoM anti
machine with 1-metal 1-poly nMOS***
DPLA* [1983] 256 reference patterns
CPU 2 FPGA
migrate several simple
[FPGA 2004] application exampes
DSP 2 FPGA
from fastest DSP:
[Xilinx 20042] 10 gMACs to 1 teraMAC
speed-up factor
method
x16 MOPS/mW
straight
forward
> x1000
multiple
aspects
x7 – x46
(compute time)
X 100
(compute time)
hi level
synthesis
(computation time)
not spec.
*) DPLA: MPC fabr. via E.I.S. multi univ. project **) Design Rule Check
2) Wim Roelandts
***) for 10-metal 3-poly cMOS expected: >> x10,000
© 2004, [email protected]
50
http://hartenstein.de
TU Kaiserslautern
Software to Configware Migration:
(RAW’99 at Orlando)
Ulrich Nageldinger‘s talk
about KressArray Xplorer:
question by a highly respected
industrial senior researcher:
„But you can‘t implement decisions!“
(symptom of ...)
© 2004, [email protected]
51
http://hartenstein.de
branching example: time-to-space migration
TU Kaiserslautern
on rDPU:
R B A
S = R + (if C then A else B endif);
C =1
on a very simple CPU
memory
C = 1 cycles
read instruction
if C
then read A
1
instruction decoding
read operand*
1
operate & register transfers
if not C
then read B
+
S
clock
read instruction
instruction decoding
read instruction
add & store
© 2004, [email protected]
total
1
instruction decoding
operate & register transfers
store result
*) if no intermed. storage in register file
1
1
5
52
http://hartenstein.de
Why the speed-up ...
TU Kaiserslautern
... although FPGA is clock slower by x 3 or even more
(most know-how from „high level synthesis“ discipline)
support operations: no clock nor memory cycle
decisions without memory cycles nor clock cycles
moving operator to the data stream (before run time)
most „data fetch“ without memory cycle
© 2004, [email protected]
53
http://hartenstein.de
>> Final Remarks <<
TU Kaiserslautern
• HPC
• Embedded Computing
• The wrong Roadmap
• Configware Engineering
• Dual Machine Paradigms
• Speed-up Examples
http://www.uni-kl.de
© 2004, [email protected]
• Final Remarks
54
http://hartenstein.de
First Indications of Change
TU Kaiserslautern
PARS & Speed-up, Basel, Switzerland, March 2003: keynote address*
10th RAW at IPDPS, Nice, France, April 2003: after
a decade of non-overlap: first IPDPS people coming
PDP’04, La Coruna, Spain, Febr. 2004: keynote address*
IPDPS, Santa Fe, NM, USA, April 2004: keynote address*
HPC Asia 2004 - 7th Int‘l Conference on High Performance Computing,
July 20-22, 2004 Omiya Sonic City, Tokyo Area, Japan:
Workshop on Reconfigurable Systems f. HPC (RHPC) + keynote address*
SBAC-PAD 2004 - 16th Symposium on Computer Architecture and High
Performance Computing, Foz do Iguacu, PR, Brazil, October 27-29,
2004: topic area explicitely: Reconfigurable Systems
HPCA-11, 11th International Symposium on High-Performance Computer
Architecture, San Francisco, Febr. 12-16, 2005: topic area explicitely:
Embedded and reconfigurable architectures
© 2004, [email protected]
55
*) keynote speaker: http://hartenstein.de
HPC experts coming ...
TU Kaiserslautern
example: N-body problem went configware
paper already
at FPL 1999
http://fpl.org
Simulation of Star Clusters: x10 speed-up
by supercomputer-to-morphware migration
(also molecular biology et al.)
Configware by
Reinhard Männer, University of Mannheim
Gottfried Kirch
HPC pioneer since 1976 (Physics Dept Heidelberg)
Astrophysics by
Rainer Spurzem, University of Heidelberg
ARI, Astrononisches Rechen-Institut, founded 1700
in Berlin, moved 1945 to Heidelberg by August Kopff
© 2004, [email protected]
56
http://hartenstein.de
August Kopff
TU Kaiserslautern
18th Director, Astrononisches
Rechen-Institut (ARI) 1924 - 1954
discovered the Kopff comet,
Koenigstuhl Observatory,
Heidelberg, Germany, 1906
discovered the asteriod 631
Philippina, 21 March 1907,
Copyright © 1996 by Masayuki Suzuki
which became the first asteroid ever visited by a spacecraft
- on the Galileo mission to Jupiter
The Galileo spacecraft's 14-year odyssey
came to an end on Sunday, Sept. 21, 2003
© 2004, [email protected]
57
http://hartenstein.de
Conclusions
TU Kaiserslautern
RC has become mainstream in all kinds of applications
CS education deficits: a curricular revision is overdue
... by a merger with the embedded systems mind set
We need an academic grass roots movement, for ....
...free material & tools for undergraduate lab courses
to program and emulate small SW/CW/HW examples
all know-how needed readily available:
© 2004, [email protected]
58
http://hartenstein.de
TU Kaiserslautern
END
© 2004, [email protected]
59
http://hartenstein.de