Coarse Grain Reconfigurable Architectures

Download Report

Transcript Coarse Grain Reconfigurable Architectures

Rhodes Island, Greece, April 25-26, 2006
Reiner Hartenstein
TU Kaiserslautern
(keynote)
(from HPC to)
New Horizons of Very High
Performance Computing (VHPC):
Hurdles and Chances
TU Kaiserslautern
Reconfigurable Supercomputing
(VHPC) going commercial
Cray XD1
silicon
graphics
RASC
… and other vendors
© 2006, [email protected]
2
http://hartenstein.de
The Pervasiveness of RC
TU Kaiserslautern
“FPGA and ….” ECE-savvy scene
# of hits
by Google
unqualified for RC ?
Math/SW-savvy scene
# of hits
by Google
647,000
1,490,000
171,000
194,000
398,000
1,620,000
127,000
113,000
158,000
162,000
915,000
272,000
© 2006, [email protected]
3
http://hartenstein.de
Methodology ?
TU Kaiserslautern
world-wide a mass movement
reminds me to the mass migration of lemmings
not really a sense of direction
terminology chaos
an urgent need to get organized
© 2006, [email protected]
4
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• Reconfigurable Computing Paradox
• The Supercomputing Paradox
• We are using the wrong model
• Coarse-grained Reconfigurable Devices
• Super Pentium for Desktop Supercomputer
http://www.uni-kl.de
© 2006, [email protected]
5
http://hartenstein.de
The Reconfigurable Computing Paradox
TU Kaiserslautern
poor FPGA technology:
very poor effective integration density
„very power-hungry“ [Rick Kornfeld*]
lower clock frequencies, and more expensive.
poor tools:
very poor application development support
Languages and tools unacceptable for software people
most hardware experts (86%**) hate their tools
poor education:
RC education: extremely poor, or none
… teach like for a 50 year old mainframe … ignored by CS curricula
© 2006, [email protected]
**) DeHon ‘98
6 *) personal communication
http://hartenstein.de
TU Kaiserslautern
Joint Task Force for Computing Curricula 2004
fully ignores
Reconfigurable Computing
Education ?
FPGA & synonyma: 0 hits
(Google: 10 million hits)
not even here
© 2006, [email protected]
7
http://hartenstein.de
Completed ?
TU Kaiserslautern
Computing Curricula v.2005:
no changes other than „… FPGA, etc.“
(not really mentioning that it‘s missing)
Taskforce activity completed ?
Next task force in 2020 or later ?
© 2006, [email protected]
8
http://hartenstein.de
Tools ?
TU Kaiserslautern
End of this week:
brainstorming session at DARPA:
(urgently needed – overdue! )
© 2006, [email protected]
9
http://hartenstein.de
Technology: fine-grained RC: 1st DeHon‘s Law
[1996: Ph. D, MIT]
TU Kaiserslautern
density: overhead:
transistors
/ microchip
wiring
FPGA
physical overhead
109
106
FPGA
logical
reconfigurability
overhead>
FPGA
routed
routing
congestion
immense area
inefficiency
103
100
>> 10 000
1980
1990
2000
© 2006, [email protected]
10
2010
http://hartenstein.de
pre-FPGA era published speed-up factors
relative
performance
TU Kaiserslautern
109
DSP and wireless
106
Image processing,
Decoding
Pattern matching, real-time face Reed-Solomon
detection
2400
6000
crypto
Multimedia video-rate stereo visionMAC 1000
Grid-based DRC
(„fair comparizon“)
1000
400
pattern recognition 730
900 288
SPIHT wavelet-based image compression 457
Bioinformatics
15000
2000
10 000
Viterbi Decoding
Smith-Waterman 10 000
pattern matching
88 molecular dynamics simulation
100
Grid-based DRC:
no FPGA: DPLA
52
FFT
protein identification BLAST
on MoM by TU-KL
Los Alamos traffic simulation 47 40
Pentium 4
103
20
Lee Routing 160
(by TU-KL)
2-D FIR filter [TU-KL]
GRAPE
Astrophysics
39,4
8080
100
1980
http://xputers.informatik.uni-kl.de/faq-pages/fqa.html
© 2006, [email protected]
1990
2000
11
2010
http://hartenstein.de
pre FPGA era: Why DPLA* was so good
TU Kaiserslautern
Large arrays of canonical boolean expressions
PLA layout ~similar to RAM / ROM layout:
Close to Moore because of small overhead
(wiring, programmability, routing)
Mid’ 80ies: first very tiny FPGAs available
2
GAG Generic Address Generator to
avoid address computation overhead
ASM
ASM: AutoSequencing
Memory
*) designed by TU-KL, fabricated by
E.I.S. German multi university project
http://hartenstein.de
©
[email protected]
[M.2006,
Herz
et al.: ICECS 2003, Dubrovnik] 12
TU Kaiserslautern
(anti-von-Neumann machine paradigm)
ASM
GAG
ASM: AutoSequencing
Memory
RAM
Generalization
of the DMA
data
counter
GAG & enabling technology:
published 1989 [by TU-KL],
Survey paper: [M. Herz et al.*:
IEEE ICECS 2003, Dubrovnik]
patented by TI** 1995
© 2006, [email protected]
Data Counter
instead of
Program Counter
13
Storge Scheme
optimization
methodology, etc.
*) IMEC & TU-KL
**) -http://hartenstein.de
TU Kaiserslautern
Thousands or Millions of $ for free
Application migration [from supercomputer]
resulting not only in massive speed-ups
Electricity bills reduced by an order of magnitude
and even more you may get for free
…. up to millions of $ dollars per year
(also a matter of national energy policy)
© 2006, [email protected]
14
Google
Amsterdam
NY
http://hartenstein.de
TU Kaiserslautern
Reconfigurable Scientific Computing
How software types do programming the FPGAs ?
Hiring a good student from the EE Dept. ?
Because of Missing RC education:
Far away from optimum solutions ?
Much higher speedup achievable ?
1 or 2 more orders of magnitude ?
100.000 ?
1.000.000 ?
© 2006, [email protected]
15
http://hartenstein.de
By education: better speed-up factors ?
relative
performance
TU Kaiserslautern
109
DSP and wireless
106
Image processing,
Decoding
Pattern matching, real-time face Reed-Solomon
detection
2400
6000
crypto
Multimedia video-rate stereo visionMAC 1000
Grid-based DRC
(„fair comparizon“)
1000
400
pattern recognition 730
900 288
SPIHT wavelet-based image compression 457
Bioinformatics
15000
2000
10 000
Viterbi Decoding
Smith-Waterman 10 000
pattern matching
88 molecular dynamics simulation
100
Grid-based DRC:
no FPGA: DPLA
52
FFT
protein identification BLAST
on MoM by TU-KL
P4
Los Alamos traffic simulation 47 40
103
20
Lee Routing 160
(by TU-KL)
2-D FIR filter [TU-KL]
GRAPE
Astrophysics
39,4
8080
100
1980
http://xputers.informatik.uni-kl.de/faq-pages/fqa.html
© 2006, [email protected]
1990
2000
16
2010
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• Reconfigurable Computing Paradox
• The Supercomputing Paradox
• We are using the wrong model
• Coarse-grained Reconfigurable Devices
• Super Pentium for Desktop Supercomputer
http://www.uni-kl.de
© 2006, [email protected]
17
http://hartenstein.de
The Supercomputing Paradox
TU Kaiserslautern
COTS processor decreasing cost
Increasing number of processors running in parallel
Growing listed Teraflops
Almost stalled application
implementation progress
Often limited sustained Teraflops
The Law of More
Very high total cost of the Tera(?)flops
Scientists waiting for affordable compute capacity
© 2006, [email protected]
18
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• Reconfigurable Computing Paradox
• The Supercomputing Paradox
• We are using the wrong model
• Coarse-grained Reconfigurable Devices
• Super Pentium for Desktop Supercomputer
http://www.uni-kl.de
© 2006, [email protected]
19
http://hartenstein.de
Why traditional supercomputing / HPC failed
TU Kaiserslautern
because of the wrong multi-core
interconnect architecture
the wrong way, how the
data are moved around
instruction-stream-based:
memory-cycle-hungry
stolen from Bob Colwell
© 2006, [email protected]
20
http://hartenstein.de
moving data around inside the Earth Simulator
TU Kaiserslautern
Crossbar weight: 220 t, 3000 km of thick cable,
ES 20: TFLOPS
© 2006, [email protected]
5120 Processors, 5000 pins each
21
http://hartenstein.de
Bringing together data and processor
TU Kaiserslautern
Moving data to by Software
the processor:
moving the
grand piano
© 2006, [email protected]
22
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• Reconfigurable Computing Paradox
• The Supercomputing Paradox
• We are using the wrong model
• Coarse-grained Reconfigurable Devices
• Super Pentium for Desktop Supercomputer
http://www.uni-kl.de
© 2006, [email protected]
23
http://hartenstein.de
rDPA
coarse-grained RC: Hartenstein‘s Law
[1996: ISIS, Austin, TX]
TU Kaiserslautern
transistors
/ microchip
109
>> 10 000
106
FPGA
routed
area efficiency
very close to
Moore‘s law
103
100
1980
1990
© 2006, [email protected]
2000
24
2010
http://hartenstein.de
higher speed-up factors by coarse-grained?
relative
performance
TU Kaiserslautern
109
DSP and wireless
106
Image processing,
Decoding
Pattern matching, real-time face Reed-Solomon
detection
2400
6000
crypto
Multimedia video-rate stereo visionMAC 1000
Grid-based DRC
(„fair comparizon“)
1000
400
pattern recognition 730
900 288
SPIHT wavelet-based image compression 457
Bioinformatics
15000
2000
10 000
Viterbi Decoding
Smith-Waterman 10 000
pattern matching
88 molecular dynamics simulation
100
Grid-based DRC:
no FPGA: DPLA
52
FFT
protein identification BLAST
on MoM by TU-KL
P4
Los Alamos traffic simulation 47 40
103
20
Lee Routing 160
(by TU-KL)
2-D FIR filter [TU-KL]
GRAPE
Astrophysics
39,4
8080
100
1980
http://xputers.informatik.uni-kl.de/faq-pages/fqa.html
© 2006, [email protected]
1990
2000
25
2010
http://hartenstein.de
Coarse grain is about computing, not logic
TU Kaiserslautern
SNN filter on KressArray (mainly a pipe network)
rout thru only
array size:
10 x 16
= 160 rDPUs
no CPU
rDPU
reconfigurable
Data Path Unit, Legend:
e. g. 32 bits wide
© 2006, [email protected]
rDPU not used
backbus connect
used for
routing only
backbus
connect
operator and routing
port location
not
usedmarker
[Ulrich Nageldinger]
26
http://hartenstein.de
SW 2coarse-grained CW migration example
TU Kaiserslautern
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
S
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
© 2006, [email protected]
+
27
http://hartenstein.de
TU Kaiserslautern
Compare it to software solution on CPU
S = R + (if C then A else B endif);
R B A
on a very simple CPU memory
C = 1 cycles
C =1
nano
seconds
read instruction
if C
then
read A
instruction decoding
read operand*
operate & register transfers
if not C
then
read B
+
+
S
S
Clock
200
read instruction
instruction decoding
read instruction
add &
store
instruction decoding
operate & register transfers
store result
total
© 2006, [email protected]
28
http://hartenstein.de
hypothetical branching example to illustrate
software-to-configware migration
TU Kaiserslautern
S = R + (if C then A else B endif);
R B A
C =1
+
S
clock
200 MHz
(5 nanosec)
© 2006, [email protected]
C=1
simple conservative CPU example
read instruction
instruction decoding
if C
then read A read operand*
operate & reg. transfers
read instruction
if not C
then read B instruction decoding
read instruction
instruction decoding
add & store
operate & reg. transfers
store result
total
memory nano
cycles seconds
1
100
1
100
1
100
1
100
1
5
100
500
*) if no intermediate storage in register file
29
http://hartenstein.de
Why the speed-up? What‘s the difference?
TU Kaiserslautern
moving the locality of operation into
the route of the data stream by P&R
instead of moving data by instruction streams
© 2006, [email protected]
30
http://hartenstein.de
The wrong mind set ....
TU Kaiserslautern
S = R + (if C then A else B endif);
„but you can‘t implement decisions!“
section of a very
thru only
large piperout
network:
R B A
C =1
not knowing this solution:
symptom of the
hardware / software chasm
and the
configware / software chasm
+
not used
backbus
connect
We need Reconfigurable
Computing
Education
Legend:
rDPU not used
backbus connect
used for routing only
operator and routing
port location marker
[Ulrich Nageldinger]
© 2006, [email protected]
31
http://hartenstein.de
The new paradigm: how the data are traveling
TU Kaiserslautern
[Jack Lipovski,
EUROMiCRO,
no, not by instruction execution
Nice, 1975]
not transport-triggered: old hat + instruction-driven
DPU
pipeline, or chaining
DPU
DPU
vN Move Processor
instruction-driven
super systolic array
P&R: move locality of
operation, not data !
© 2006, [email protected]
32
http://hartenstein.de
time
ASM
x x x
x x x -
ASM
x x x - -
ASM
50 & more
on-chip ASM
are feasible
© 2006, [email protected]
ASM
Data streams
input data stream
|
|
port #
- - - x x x
port #
H. T. Kung paradigm
(systolic array)
|
x
x
x
|
|
|
|
|
|
|
|
|
|
|
x
x
x
x
x
x
time
33
|
x
x
x
ASM
implemented
by distributed
memory
DPA
x
x
x
ASM
(pipe network)
x
x
x
ASM
TU Kaiserslautern
ASM
ASM
define:
... which data item
time
at which time
at which port
time
ASM
- - - - x x x
ASM
- - - - - x x x
ASM
port #
GAG
output data streams
RAM
data
counter
ASM: AutoSequencing
Memory
http://hartenstein.de
The Generalization of the Systolic Array
TU Kaiserslautern
Kress-Kung paradigm
super systolic array
only for applications with
regular data dependencies
remedy? discard algebraic
synthesis methods
[R. Kress]:
use optimization algorithms
e. g.: simulated annealing
reconfigurability makes sense
Achievement: also non-linear and non-uniform pipes,
and even more wild pipe structures possible
© 2006, [email protected]
34
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• Reconfigurable Computing Paradox
• The Supercomputing Paradox
• We are using the wrong model
• Coarse-grained Reconfigurable Devices
• Super Pentium for Desktop Supercomputer
http://www.uni-kl.de
© 2006, [email protected]
35
http://hartenstein.de
TU Kaiserslautern
Here is the common model
it’s not von Neumann
the vN monopoly
in our curricula is
severely harmful
we need
dual paradigm software code configware code
education
instructiondatastreambased
CPU
the tail is
wagging
the dog
© 2006, [email protected]
36
streambased
reconfigurable
accelerator
hardwired
accelerator
http://hartenstein.de
TU Kaiserslautern
A potential Pentium successor
Discard most caches
have 64* cores, 0.5 - 1 GHz
with clever interconnect for:
! concurrent processes and
! and for multithreading, and, for
! Kung-Kress pipe network
The Desk-top Supercomputer!
*) CPU mode / DPU mode capability
© 2006, [email protected]
37
http://hartenstein.de
“Super Pentium” configuration example
TU Kaiserslautern
CPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
CPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
CPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
CPU
© 2006, [email protected]
38
http://hartenstein.de
e. g.: ~ 8 x 8 rDPA: all feasible under 500 MHz
TU Kaiserslautern
World TV &
game console &
multi media center
• Variable resolutions and refresh rates
Games
• Variable scan mode characteristics
• Noise Reduction and Artifact Removal
• High performance requirements
• Variable file encoding formats
• Variable content security formats
Camera
• Variable Displays
• Luminance processing
• Detail enhancement
• Color processing
SD/MMC Cards
• Sharpness Enhancement
• Shadow Enhancement
• Differentiation
• Programmable de-interlacing heuristics
• Frame rate detection and conversion
Radio• Motion detection & estimation & compensation
Interface
• Different standards (MPEG2/4, H.264)
• A single device handles all modes
http://pactcorp.com
© 2006, [email protected]
Videos
Music
SMeXPP
rDPA
LCD DISPLAY
BasebandProcessor
39
Audio-
Interface
http://hartenstein.de
Dual Paradigm Application Development
TU Kaiserslautern
high level language
software/configware
co-compiler
software code
instructionstreambased
CPU
configware code
datastreambased
reconfigurable
accelerator
hardwired
accelerator
© 2006, [email protected]
40
http://hartenstein.de
TU Kaiserslautern
Software / Configware Co-Compilation
C language source
supporting
different
platforms
Partitioner
SW
compiler
CPU
CW
compiler
Resource
Parameters
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
© 2006, [email protected]
41
Placement
& Routing
(Move the
Locality of
Operation)
Juergen Becker’s
CoDe-X, 1996
http://hartenstein.de
Bringing together data and processor
TU Kaiserslautern
Place the
location of
execution into
the data pipe
by
Configware
Move the stool
© 2006, [email protected]
42
http://hartenstein.de
>> Conclusions <<
TU Kaiserslautern
• Reconfigurable Computing Paradox
• The Supercomputing Paradox
• We are using the wrong model
• Coarse-grained Reconfigurable Devices
• Super Pentium for Desktop Supercomputer
• Conclusions
http://www.uni-kl.de
© 2006, [email protected]
43
http://hartenstein.de
Conclusions (1): Hurdles
TU Kaiserslautern
Obstacles are:
unbelievably disastrous tools market:
enabling technologies available,
partly decades old, but not used
unbelievably ignorant curricula:
fragmentation into application-domainspecific cultures and trick boxes
transdisciplinary models not available
nor taught at CS, nor elsewhere
© 2006, [email protected]
44
http://hartenstein.de
Conclusions (2): Future Work
TU Kaiserslautern
The monopoly of the von-Neumann-based mind set
in CS education:
heavily stalls progress in R&D, not only in HPC
causes high cost in R&D, not only in supercomputing
CS graduates are not qualified for our job market
The von-Neumann-only-based mind set in CS urgently
needs to go to adopt the dual paradigm common model
CS disciplines must recognize and accept its strategic
role and its responsibility toward all its application
disciplines: embedded and scientific computing.
© 2006, [email protected]
45
http://hartenstein.de
TU Kaiserslautern
Conclusions (3): Chances
New horizons:
chances are brilliant
© 2006, [email protected]
46
http://hartenstein.de
TU Kaiserslautern
thank you
© 2006, [email protected]
47
http://hartenstein.de
TU Kaiserslautern
END
© 2006, [email protected]
48
http://hartenstein.de
TU Kaiserslautern
thank you
© 2006, [email protected]
49
http://hartenstein.de
TU Kaiserslautern
Backup:
© 2006, [email protected]
50
http://hartenstein.de
Co-Compiler Enabling Technology
TU Kaiserslautern
is available from academia
only a small team needed for
commercial re-implementation
on the road map to the
Personal Supercomputer
© 2006, [email protected]
51
http://hartenstein.de
Compilation: Software vs. Configware
TU Kaiserslautern
Software
Engineering
source program
Configware
Engineering
C, FORTRAN
MATHLAB
placement source „program“
& routing
mapper
software
compiler
configware
compiler
data scheduler
software code
configware code
© 2006, [email protected]
52
flowware code
http://hartenstein.de
TU Kaiserslautern
Nick Tredennick’s Paradigm Shifts
explain the differences
Software Engineering
CPU
software
resources: fixed
algorithm: variable
1 programming
source needed
Configware Engineering
configware
flowware
© 2006, [email protected]
resources: variable
algorithm: variable
53
2 programming
sources needed
http://hartenstein.de
Co-Compilation
TU Kaiserslautern
C, FORTRAN, MATHLAB
automatic SW / CW partitioner
Software /
Configware
software Co-Compiler
compiler
mapper
configware
compiler
data scheduler
software code
configware code
© 2006, [email protected]
54
flowware code
http://hartenstein.de
Co-Compiler for Hardwired Kress/Kung Machine
[e. g. Brodersen]
TU Kaiserslautern
source
automatic SW / CW partitioner
Software /
software
Flowware
compiler Co-Compiler
flowware
compiler
data scheduler
software code
© 2006, [email protected]
55
flowware code
http://hartenstein.de
The first archetype machine model
TU Kaiserslautern
Software Industry
procedural
personalization
instruction-streambased mind set
“von Neumann”
© 2006, [email protected]
Software Industry’s
Secret of Success
compile or
assemble
main
frame
CPU
56
simple basic .
Machine Paradigm
personalization:
RAM-based
http://hartenstein.de
The 2nd archetype machine model
TU Kaiserslautern
Configware Industry
structural
personalization
data-streambased mind set
compile
reconfigurable
accelerator
“Kress-Kung”
© 2006, [email protected]
Configware Industry’s
Secret of Success
57
simple basic .
Machine Paradigm
personalization:
RAM-based
http://hartenstein.de
TU Kaiserslautern
„Saves more than $10,000 in electricity bills per
year (7¢ / kWh) - .... per 64-processor 19" rack“
[Herb Riley, R. Associates]
© 2006, [email protected]
58
http://hartenstein.de
TU Kaiserslautern
modern FPGA bestsellers:
The new model is reality:
FPGA fabrics, together with
several µprocessors,
many memory banks,
and other IP cores,
on the same COTS microchip
© 2006, [email protected]
59
http://hartenstein.de
TU Kaiserslautern
DSP platform FPGA
[courtesy Xilinx Corp.]
500MHz PowerPC™ Processors
(680DMIPS)
with
Auxiliary Processor Unit
500MHz multi-port
Distributed 10 Mb SRAM
500MHz DCM Digital
Clock Management
500MHz Flexible
Soft Logic Architecture
200KLogic Cells
0.6-11.1Gbps
Serial Transceivers
1Gbps Differential
I/O
500MHz Programmable DSP
Execution Units
© 2006, [email protected]
60
http://hartenstein.de