Transcript www.fpl.uni

SBCCI 2006 - the 19th Symposium on
Integrated Circuits and System Design
Ouro Preto, Minas Gerais, Aug 28 - Sept 1, 2006
SBMICRO 2006 - the 21st
Symposium on Microelectronics
Technology and Devices
Reiner Hartenstein
TU Kaiserslautern
Re-definition of Low Power Design
for HPC: a Paradigm Shift
>> Outline <<
TU Kaiserslautern
• Preface
• The Supercomputing Crisis
• The Solution ignored for decades
• Fine-grained vs. coarse-grained
• The wrong Road Map for CS Curricula
• Conclusions
http://www.uni-kl.de
© 2006, [email protected]
2
http://hartenstein.de
TU Kaiserslautern
Pervasiveness of FPGA application
More recently FPGAs as accelerators went
also into every area of scientific computing
Compute-intensive: my talk does not really
cover performance of bulk storage, discs, etc.
highlights the supercomputing paradigm trap
and a fully ignored early solution
illustrates why behind FPGA success there
is a hidden paradigm shift
What we learn for Low Power Design
© 2006, [email protected]
3
http://hartenstein.de
TU Kaiserslautern
Up to 4 orders of magnitude
For many published speed-up factors
obtained from software-to-FPGA migration
see Jürgen Beckers part of Monday tutorial
But before FPGAs came up, DPLA*
(a programmable PLA) was successful
inside the MoM colmputer architecture
*) designed at Kaiserslautern and fabricated via the
German multi university E.I.S. project infrastructure
© 2006, [email protected]
4
http://hartenstein.de
1986: Xputer Lab at Kaiserslautern: MoM I and II
TU Kaiserslautern
© 2006, [email protected]
5
http://hartenstein.de
The Reconfigurable Computing
TU Kaiserslautern
paradox
the effective integration density of FPGAs
is behind the Gordon Moore curve by
more than 4 orders of magnitude
• wiring overhead
• reconfigurability overhead
• routing congestion
• Low clock frequency
• Power-hungry
• Going worse for larger FPGAs
© 2006, [email protected]
6
http://hartenstein.de
An Example: FPGAs in Oil and Gas .... (1)
[Herb Riley, R. Associates]
TU Kaiserslautern
„Application migration [from supercomputer]
has resulted in a 17-to-1 increase in performance"
For this example speed-up is not my key issue
(Jürgen Becker‘s tutorial showed much higher
speed-ups - going upto a factor of 6000)
For this oil and gas example a side effect is
much more interesting than the speed-up
© 2006, [email protected]
7
http://hartenstein.de
An Example: FPGAs in Oil and Gas .... (2)
[Herb Riley, R. Associates]
TU Kaiserslautern
„Application migration [from supercomputer]
has resulted in a 17-to-1 increase in performance"
Saves more than $10,000 in electricity bills
per year (7¢ / kWh) - .... per 64-processor 19" rack
did you know …
… 25% of Amsterdam‘s electric energy
consumption goes into server farms ?
… a quarter square-kilometer of office floor space
within New York City is occupied by server farms ?
© 2006, [email protected]
8
http://hartenstein.de
TU Kaiserslautern
Oil and Gas as a strategic issue
Low power design: not only to keep the chips cool
You know the amount of Google’ s electricity bill?
It should be investigated, how far the migrational
achievements obtained for computationally intensive
applications, can also be utilized for servers
Recently the US senate ordered a study
on the energy consumption of servers
© 2006, [email protected]
9
http://hartenstein.de
FPGA use: A new direction in low power Design
as a panelist at:
TU Kaiserslautern
http://www.patmos-conf.org
Father of ISLPED
Sept. 13-15, 2006,
Montpellier,
France
2006 International Symposium on
Low Power Electronics and Design,
(ISLPED), October 4-6, 2006
Rottach-Egern, Tegernsee, Germany
http://www.islped.org/
© 2006, [email protected]
10
http://hartenstein.de
Reconfigurability per se is not the key
TU Kaiserslautern
It’s the paradigm coming along with it
Note: no instruction fetch at run time !
Data streams instead of instruction streams
Enabling technology for data sequencers
brings further performance improvements
A non-reconfigurable example is the BEE
project (Bob Broderson et al., UC Berkeley)
© 2006, [email protected]
11
http://hartenstein.de
TU Kaiserslautern
Earth
Simulator
MDGrape-3
W/Gflops $/Gflops
factor
Petaflops
by GRAPE
128
8000
(non-reconfigurable)
0.2
15
640
533
massive pipelining
and on-chip
distributed memory
GRAvity PipE: special purpose computer for
astrophysical N-body simulations, and,
Molecular Dynamics Simulations
MDGRAPE-3 (aka Protein Explorer):
Petaflops-GRAPE [Univ. of Tokyo & Genomic
Sciences Center at RIKEN institute]
© 2006, [email protected]
12
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• Preface
• The Supercomputing Crisis
• The Solution ignored for decades
• Fine-grained vs. coarse-grained
• The wrong Road Map for CS Curricula
• Conclusions
http://www.uni-kl.de
© 2006, [email protected]
13
http://hartenstein.de
TU Kaiserslautern
Explanation of the RC paradox
each technology providing a factor of 10 or more
improvements over an established one, can be
expected to become disruptive [Andy Grove].
The analysis of the Supercomputing crisis
explains why the “bad” FPGA are so disruptive
© 2006, [email protected]
14
http://hartenstein.de
TU Kaiserslautern
Going toward “connected thinking”
The heyday of reductionism has passed.
[pwc.com]
Impenetrable obstacles have been encountered
which cannot be solved by the classical simple
reductionist approach.
This is the reason of the growing worldwide
significance of transdisciplinary notions
We need Coherence instead of
fragmentation into specialists’ niche areas
This is heralding a new era
© 2006, [email protected]
15
http://hartenstein.de
The basic model paradigm trap
TU Kaiserslautern
frustrates interdisciplinary education efforts
fragmentation in CS even betw. subdisciplines
High performance computing
stalled for decades by the
von Neuman paradigm trap:
the wrong road map.
The right roadmap kept by
another trap for decades !
© 2006, [email protected]
16
stolen from Bob Colwell
http://hartenstein.de
TU Kaiserslautern
Transdisciplinary Education?
Computer Science not prepared
Lacking intradisciplinary cohesion
between the mind sets of:
•Theoreticians (Math background)
•Hardware People •Computer Architects
•Embedded Syst. Designers
•Software People (Application Development)
for decades: the Hardware / Software chasm
turns into: the Configware / Software chasm
© 2006, [email protected]
17
http://hartenstein.de
TU Kaiserslautern
© 2006, [email protected]
migration of
the lemings
18
[David Padua, John Hennessy, et al.]
Flag ship conference series: IEEE ISCA
Jean-Loup Baer
http://hartenstein.de
The Dead Supercomputer Society
TU Kaiserslautern
•ACRI
•Alliant
•American
Supercomputer
•Ametek
•Applied Dynamics
•Astronautics
•BBN
•CDC
•Convex
•Cray Computer
•Cray Research
•Culler-Harris
•Culler Scientific
•Cydrome
•Dana/Ardent/
Stellar/Stardent
© 2006, [email protected]
Research 1985 – 1995 [Gordon Bell, keynote ISCA 2000]
•DAPP
•Denelcor
•Elexsi
•ETA Systems
•Evans and Sutherland
•Computer
•Floating Point Systems
•Galaxy YH-1
•Goodyear Aerospace MPP
•Gould NPL
•Guiltech
•ICL
•Intel Scientific Computers
•International Parallel Machines
•Kendall Square Research
•Key Computer Laboratories
•MasPar
19
•Meiko
•Multiflow
•Myrias
•Numerix
•Prisma
•Tera
•Thinking Machines
•Saxpy
•Scientific Computer
•Systems (SCS)
•Soviet Supercomputers
•Supertek
•Supercomputer Systems
•Suprenum
•Vitesse Electronics
http://hartenstein.de
Monstrous Steam Engines of Computing
TU Kaiserslautern
Crossbar weight: 220 t, 3000 km of thick cable,
5120 Processors, 5000 pins each
ES 20: TFLOPS
peak or sustained?
© 2006, [email protected]
20
http://hartenstein.de
Illustrating the von Neumann paradigm trap
the watering can model
TU Kaiserslautern
[Hartenstein]
The instruction-stream-based approach
many watering cans
The data-stream-based approach
has no von
Neumann
bottleneck
von
Neumann
bottleneck
© 2006, [email protected]
21
http://hartenstein.de
TU Kaiserslautern
The Memory Wall (1)
Moving data to
the processor:
© 2006, [email protected]
22
http://hartenstein.de
Data meeting the Processing Unit (PU)
TU Kaiserslautern
We have
2 choices
routing the data by
memory-cycle-hungry
instruction streams
placement of the
execution locality
by Software
by
Configware
optimize a pipe network:
place PU in data stream
© 2006, [email protected]
23
http://hartenstein.de
TU Kaiserslautern
The Memory Wall (2)
Key problem is the inefficiency
and complexity of moving data,
not processor performance.
Most important goal is the
minimization of the number
of main memory cycles.
Tear down
this Wall !
© 2006, [email protected]
Supercomputing urgently needs a
fundamentally different approach
toward interconnect efficiency.
24
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• Preface
• The Supercomputing Crisis
• The Solution ignored for decades
• Fine-grained vs. coarse-grained
• The wrong Road Map for CS Curricula
• Conclusions
http://www.uni-kl.de
© 2006, [email protected]
25
http://hartenstein.de
The right road map to HPC:
TU Kaiserslautern there ignored for decades
massively reducing memory cycles
DPA
DPU operation is
transport-triggered
|
- - - x x x
- - - - x x x
x x x - -
nor thru common memory
- - - - - x x x
|
|
|
|
|
|
|
|
|
|
|
x
x
x
where were the
supercomputing people ?
© 2006, [email protected]
|
26
input data streams
|
x x x
x x x -
no instruction streams
no message passing
x
x
x
x
x
x
x
x
x
x
x
x
output data streams
|
x
x
x
http://hartenstein.de
TU Kaiserslautern
The Systolic Array
nice time/space
notation - defines:
... which data item
time
at which time
at which port
x
x
x
(pipe network) DPA*
*) DataPath Array
(array of DPUs)
DataPath Unit has
no program counter!
it’s no CPU!
time
(H. T. Kung paradigm)
|
input data stream
|
|
x x x
x x x -
port #
- - - x x x
time
- - - - x x x
x x x - -
- - - - - x x x
port #
|
|
|
|
|
|
|
|
|
|
|
x
x
x
© 2006, [email protected]
x
x
x
x
x
x
CS Mathematicians‘
hobby, early 80ies
time
27
x
x
x
port #
output data streams
|
x
x
x
http://hartenstein.de
Terminology
TU Kaiserslautern
term
CPU
CPU
DPU**
DPU
progra
m
counter
DPU
execution
program triggered
counter
by
instructioninstruction
streamfetch
based
yes
data
arrival*
no
**) does not have a program counter
© 2006, [email protected]
paradigm
28
datastreambased
*) “transport-triggered”
http://hartenstein.de
The new paradigm: how the data are traveling
TU Kaiserslautern
[Jack Lipovski,
EUROMiCRO,
better not by instruction execution
Nice, 1975]
An old hat: transport-triggered + instruction-driven
DPU
pipeline, or chaining
DPU
DPU
vN Move Processor
instruction-driven
super systolic array
P&R: move locality of
operation, not data !
© 2006, [email protected]
29
http://hartenstein.de
Mathematicians X-ing
TU Kaiserslautern
Systolic
Synthesis
Mathematicians like the
beauty and elegance
of Systolic Arrays.
Due to a lacking intradisciplinary view, their
efforts yielded poor
synthesis algorithms.
Reiner Hartenstein
© 2006, [email protected]
30
http://hartenstein.de
TU Kaiserslautern
Synthesis Method?
of course, algebraic !
Algebraic means linear projection, restricted to
uniform arrays, only with linear pipes
useful only for applications with
strictly regular data dependencies:
Mathematicians caught by their own paradigm trap
for more than a decade
rDPA:
Generalization* by a transdisciplinary hardware guy:
Rainer Kress discarded their algebraic synthesis
methods and replaced it by simulated annealing. 1995
*) super-systolic
© 2006, [email protected]
31
http://hartenstein.de
TU Kaiserslautern
Generating the Data Streams
Who generates the
data streams ?
Mathematicians:
it‘s not our job
DPA
x
x
x
x
x
x
|
x
x
x
|
|
x x x
x x x -
- - - x x x
- - - - x x x
x x x - -
© 2006, [email protected]
- - - - - x x x
|
|
|
|
|
|
|
|
|
|
|
x
x
x
(it‘s not algebraic)
32
input data streams
x
x
x
output data streams
|
x
x
x
http://hartenstein.de
TU Kaiserslautern
No machine paradigm
Only one half of the machine
Defined only the data path, however,
without the sequencing resources
Mathematicians considered that providing
the enabling technology is somebody else‘s job
© 2006, [email protected]
33
http://hartenstein.de
Disclaimer
TU Kaiserslautern
But there are mathematicians
who are no reductionists
e. g., fully spanning the
transdisciplinary cohesion from
Term Rewriting Systems, over
to dynamically reconfigurable
system design & synthesis
© 2006, [email protected]
34
http://hartenstein.de
use data counters,
no program counter
x
x
x
|
|
|
x x x - -
32 ports, or
n x 32 ports
© 2006, [email protected]
|
|
|
|
|
|
|
|
|
|
x
x
x
x
x
x
35
|
x
x
x
ASM
other example
|
ASM
50 & more
on-chip ASM
are feasible
x
x
x
x x x
x x x -
ASM
implemented ASM
by distributed ASM
on-chip memory ASM
x
x
x
ASM
reconfigurable
(pipe network) rDPA
ASM
ASM
TU Kaiserslautern
ASM Data stream
generators
- - - x x x
ASM
- - - - x x x
ASM
- - - - - x x x
ASM
non-von-Neumann
machine paradigm
GAG
RAM
data
counter
ASM: AutoSequencing
Memory
http://hartenstein.de
TU Kaiserslautern
(anti-von-Neumann machine paradigm)
ASM
GAG
ASM: AutoSequencing
Memory
RAM
Generalization
of the DMA
data
counter
GAG & enabling technology:
published 1989 [by TU-KL],
Survey paper: [M. Herz et al.*:
IEEE ICECS 2003, Dubrovnik]
patented by TI** 1995
© 2006, [email protected]
Data Counter
instead of
Program Counter
36
Storge Scheme
optimization
methodology, etc.
*) IMEC & TU-KL
**) -http://hartenstein.de
Compilation: Software vs. Configware
TU Kaiserslautern
Software
Engineering
source program
software
compiler
software code
instruction streams
© 2006, [email protected]
Configware
Engineering
C, FORTRAN
MATHLAB, …
placement source „program“
& routing
mapper
configware
compiler
data scheduler
configware
code
flowware code
data streams
configuration
37
http://hartenstein.de
TU Kaiserslautern
Educational Deficits
Educational deficits have stalled Reconfigurable
Computing (RC) as well as classical supercomputing
Transdisciplinary fragmentation: each application
domain uses its own trick boxes
Too many sophisticated very clever architectures
We need a fundamental model with a methodology
which all application domains have in common
Transdisciplinary education & basic research needed
© 2006, [email protected]
38
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• Preface
• The Supercomputing Crisis
• The Solution ignored for decades
• Fine-grained vs. coarse-grained
• The wrong Road Map for CS Curricula
• Conclusions
http://www.uni-kl.de
© 2006, [email protected]
39
http://hartenstein.de
Coarse-grained vs. fine-grained
TU Kaiserslautern
device
granularity
path width eff’ve density flexibility
general
FPGA
fine-grained
~ 1 bit
very low
purpose
DPA
coarse-grained multi bit:
specialized
very high
rDPA
coarse-grained e.g. 32 bits
domainplatform fine-grained &
specific
mixed
high
FPGA
embedded hdw.
© 2006, [email protected]
40
http://hartenstein.de
Why coarse grain
TU Kaiserslautern
much more area-efficient
instead of rLB (~1 bit wide)
much less
use rDPU (e. g. 32 bits wide)
reconfigurability
overhead
reconfigurable Data Path Unit (e. g. rALU)
much more MOPS/milliWatt
instead of FPGA use rDPA
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
mind set close to classical computing background
© 2006, [email protected]
41
http://hartenstein.de
Coarse grain is about computing, not logic
TU Kaiserslautern
Example: mapping onto rDPA by DPSS: based on simulated annealing
SNN filter on KressArray (mainly a pipe network)
rout thru only
array size:
10 x 16
= 160 rDPUs
no CPU
reconfigurable
function block, Legend:
rDPU not used
[Ulrich Nageldinger]
e. g. 32 bits wide
© 2006, [email protected]
backbus connect
used for
routing only
backbus
connect
42
operator and routing
port location
not
usedmarker
http://hartenstein.de
(r)DPA
TU Kaiserslautern
commercial rDPA example:
PACT XPP - XPU128
XPP128 rDPA
ALU
• Full 32 or 24 Bit Design working silicon
• 2 Configuration Hierarchies
• Evaluation Board available, and
• XDS Development Tool with Simulator
© 2006, [email protected]
buses
not
shown
Ctrl
CFG
rDPU
PAE
core
© PACT AG, http://pactcorp.com
43
http://hartenstein.de
e. g.: array w. 56 rDPUs: running under 500 MHz
TU Kaiserslautern
World TV &
game console &
multi media center
• Variable resolutions and refresh rates
Games
• Variable scan mode characteristics
• Noise Reduction and Artifact Removal
• High performance requirements
• Variable file encoding formats
• Variable content security formats
Camera
• Variable Displays
• Luminance processing
• Detail enhancement
• Color processing
SD/MMC Cards
• Sharpness Enhancement
• Shadow Enhancement
• Differentiation
• Programmable de-interlacing heuristics
• Frame rate detection and conversion
Radio• Motion detection & estimation & compensation
Interface
• Different standards (MPEG2/4, H.264)
• A single device handles all modes
http://pactcorp.com
© 2006, [email protected]
Videos
Music
SMeXPP
rDPA
LCD DISPLAY
BasebandProcessor
44
Audio-
Interface
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• Preface
• The Supercomputing Crisis
• The Solution ignored for decades
• Fine-grained vs. coarse-grained
• The wrong Road Map for CS Curricula
• Conclusions
http://www.uni-kl.de
© 2006, [email protected]
45
http://hartenstein.de
TU Kaiserslautern
Joint Task Force for Computing Curricula 2004
fully ignores
Reconfigurable Computing
Curricula ?
FPGA & synonyma: 0 hits
(Google: 10 million hits)
not even here
© 2006, [email protected]
46
http://hartenstein.de
TU Kaiserslautern
Curriculum Recommendations, v. 2005
Upon my complaints* the only change: including
at end of last paragraph of the survey volume:
"programmable hardware (including
FPGAs, PGAs, PALs, GALs, etc.)."
However, no structural changes at all
v. 2005 intended to be the final version (?)
torpedoing the transdisciplinary
responsibility of CS curricula
This is criminal !
Peter Denning …
© 2006, [email protected]
47
*) no reply
http://hartenstein.de
with ACM and IEEE-CS: not in good hands
TU Kaiserslautern
works towards the development of principles and ideas
for multidisciplinary modes of research and education.
We need SDPS to identify intra-disciplinary
communication gaps in CS
to develop a roadmap for CS to assume
intradisciplinary responsibility for education
© 2006, [email protected]
48
http://hartenstein.de
SDPS, the first transdisciplinary society
TU Kaiserslautern
The transdisciplinary genie is out of the bottle.
There is no turning back from interdisciplinary
cohesion and integrative attempts to solve the
complex problems of mankind in this century.
The era of individual disciplinary successes
and accumulating disciplinary silos of
locally functional knowledge has ended
with the 20th century.
© 2006, [email protected]
49
http://hartenstein.de
IDPT - Call for Papers
TU Kaiserslautern
D DESIGN
http://hartenstein.de/IDPT2007/
© 2006, [email protected]
50
http://hartenstein.de
TU Kaiserslautern
IDPT 2007
IDPT 2006 Speakers:
8 University Presidents
( 1 founding president)
10 Deans
(1 founding dean)
1 Nobel Prize Laureate
6 Directors
and many others …
© 2006, [email protected]
51
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• Preface
• The Supercomputing Crisis
• The Solution ignored for decades
• Fine-grained vs. coarse-grained
• The wrong Road Map for CS Curricula
• Conclusions
http://www.uni-kl.de
© 2006, [email protected]
52
http://hartenstein.de
Conclusion
TU Kaiserslautern
We need a Re-definition of Low Power Design
not only for microprocessors and
embedded systems,
but also for HPC and supercomputing:
as a Paradigm Shift and a strategic issue
© 2006, [email protected]
53
http://hartenstein.de
TU Kaiserslautern
thank you
© 2006, [email protected]
54
http://hartenstein.de
TU Kaiserslautern
END
© 2006, [email protected]
55
http://hartenstein.de
TU Kaiserslautern
Backup for
Discussion:
© 2006, [email protected]
56
http://hartenstein.de
TU Kaiserslautern
Here is the common model
it’s not von Neumann
most accumulated
MIPS have been
migrated here
mainly just for
running legacy software code configware code
code etc.
instructiondatastreambased
CPU
the tail is
wagging
the dog
© 2006, [email protected]
57
streambased
reconfigurable
accelerator
hardwired
accelerator
http://hartenstein.de
Dual Paradigm Application Development
TU Kaiserslautern
high level language
Juergen Becker’s
CoDe-X, 1996
C language source
Partitioner
SW
compiler
CPU
CW
compiler
software/configware
co-compiler
software code
instructionstreambased
CPU
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
reconfigurable
accelerator
hardwired
accelerator
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
© 2006, [email protected]
configware code
datastreambased
58
http://hartenstein.de
For Transdisciplinary CS Education
TU Kaiserslautern
The
von-Neumann-only
mind set is obsolete
structural
procedural
procedural-only
datastreambased
instructionstreambased
We need a curricular
dual-paradigm approach
© 2006, [email protected]
59
http://hartenstein.de
The supercomputing paradigm trap
TU Kaiserslautern
this did not prevent supercomputing from
following the wrong rodmap for decades,
imprisoned by the von Neumann paradigm trap
No technology transfer from Mathematics:
caught by the algebraic paradigm trap
(systolic array scene)
© 2006, [email protected]
60
http://hartenstein.de
TU Kaiserslautern
The language and tool disaster
End of April a DARPA brainstorming conference
Software people do not speak VHDL
Hardware people do not speak MPI
Bad quality of the application development tools
A poll at FCCM’98 revealed, that
86% hardware designers hate their tools
© 2006, [email protected]
61
http://hartenstein.de
TU Kaiserslautern
Escaping the paradigm trap
The underground success story of FPGAs
The fastest growing segment
of the semiconductor market
Massive speed-up
Slashing the electricity bill
However, this is not supported
by our education systems
© 2006, [email protected]
62
http://hartenstein.de
The end of Moore’s Law
TU Kaiserslautern
complexity and clock frequency of single-core
microprocessors come to an end
Multi-core microprocessor chips emerging:
32 cores on a chip from AMD by 2010.
Just more CPUs on the chip is not the way
to go for very high performance.
This lesson we have learnt from the supercomputing
community paying an extremely high price for
monstrous installations by having followed the wrong
road map for decades.
Such fundamental bottlenecks in computer science
will necessitate new breakthroughs
© 2006, [email protected]
63
http://hartenstein.de
Algorithms: fundamental misconception
TU Kaiserslautern
Instead of hitting physical limits we
found, that further progress is limited
by a fundamental misconception in the
theory of algorithmic complexity.
Not processing data is costly, but moving data.
We have to rethink the basic
assumptions behind computing.
© 2006, [email protected]
64
http://hartenstein.de
Taxonomy of Algorithm Migration (1)
TU Kaiserslautern
(Instruction-stream-based algorithm taxonomy:
partially existing, not really systematic)
Algorithms migrated to time-space domain
(for RC): a taxonomy is not existing
Computationally intensive applications are
the best candidates for migration to FPGA
A few algorithms (e. g. Turbocode or Viterbi)
require a massive amount of interconnect
bulk data bases might be subject of FPGA usage
to avoid memory cycles for address computation
Steadily coming and going data streams are best candidates
© 2006, [email protected]
65
http://hartenstein.de
Taxonomy of Algorithm Migration (2)
TU Kaiserslautern
Migration efficiency (reducing memory cycles):
Servers: to be investigated - for sure is:
• loop transformations: efficient, deterministic
• caches: indeterministic and energy guzzlers
• much less local memory needed
• secondary data memory: distributed on-chip
memory architectures highly promising
• address computations: efficient migration
© 2006, [email protected]
66
http://hartenstein.de
TU Kaiserslautern
© 2006, [email protected]
67
http://hartenstein.de
configware solution: computing in space
for demo: a tiny section of the pipe network
inter-rDPU-communication: no memory cycles needed
TU Kaiserslautern
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
+
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
S
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
© 2006, [email protected]
68
http://hartenstein.de
TU Kaiserslautern
Compare it to software solution on CPU
S = R + (if C then A else B endif);
R B A
on a very simple CPU memory
C = 1 cycles
C =1
nano
seconds
read instruction
if C
then
read A
instruction decoding
read operand*
operate & register transfers
if not C
then
read B
+
+
S
S
Clock
200
read instruction
instruction decoding
read instruction
add &
store
instruction decoding
operate & register transfers
store result
total
© 2006, [email protected]
69
http://hartenstein.de
section of a major pipe network on rDPU
hypothetical branching example to illustrate
software-to-configware migration
TU Kaiserslautern
S = R + (if C then A else B endif);
R B A
C =1
+
S
clock
200 MHz
(5 nanosec)
© 2006, [email protected]
C=1
simple conservative CPU example
read instruction
instruction decoding
if C
then read A read operand*
operate & reg. transfers
read instruction
if not C
then read B instruction decoding
read instruction
instruction decoding
add & store
operate & reg. transfers
store result
total
memory nano
cycles seconds
1
100
1
100
1
100
1
100
1
5
100
500
*) if no intermediate storage in register file
70
http://hartenstein.de
The wrong mind set ....
TU Kaiserslautern
S = R + (if C then A else B endif);
section of a very
large pipe network:
R B A
C =1
„but you can‘t implement decisions!“
not knowing this solution:
symptom of the
hardware / software chasm
+
© 2006, [email protected]
and the
configware / software chasm
71
http://hartenstein.de
Co-Compiler Enabling Technology
TU Kaiserslautern
is available from academia
only a small team needed for
commercial re-implementation
on the road map to the
Personal Supercomputer
© 2006, [email protected]
72
http://hartenstein.de
TU Kaiserslautern
Flowware Languages vs. Software
© 2006, [email protected]
73
http://hartenstein.de
Compilation: Software vs. Configware
TU Kaiserslautern
Software
Engineering
source program
Configware
Engineering
C, FORTRAN
MATHLAB
placement source „program“
& routing
mapper
software
compiler
configware
compiler
data scheduler
software code
configware code
© 2006, [email protected]
74
flowware code
http://hartenstein.de
TU Kaiserslautern
Nick Tredennick’s Paradigm Shifts
explain the differences
Software Engineering
CPU
software
resources: fixed
algorithm: variable
1 programming
source needed
Configware Engineering
configware
flowware
© 2006, [email protected]
resources: variable
algorithm: variable
75
2 programming
sources needed
http://hartenstein.de
Co-Compilation
TU Kaiserslautern
C, FORTRAN, MATHLAB
automatic SW / CW partitioner
Software /
Configware
software Co-Compiler
compiler
mapper
configware
compiler
data scheduler
software code
configware code
© 2006, [email protected]
76
flowware code
http://hartenstein.de
Co-Compiler for Hardwired Kress/Kung Machine
[e. g. Brodersen]
TU Kaiserslautern
source
automatic SW / CW partitioner
Software /
software
Flowware
compiler Co-Compiler
flowware
compiler
data scheduler
software code
© 2006, [email protected]
77
flowware code
http://hartenstein.de
The Pervasiveness of
Reconfigurable Computing (RC)
FPGAs are used everywhere
Nov. 2005
TU Kaiserslautern
“FPGA and ….”
# of hits
by Google
# of hits
by Google
647,000
1,490,000
171,000
194,000
398,000
1,620,000
127,000
113,000
158,000
162,000
915,000
272,000
© 2006, [email protected]
78
http://hartenstein.de
some published speed-up factors
The RC
paradox
relative
performance
TU Kaiserslautern
109
DSP and wireless
Image processing,
Decoding
Pattern matching, real-time face Reed-Solomon
detection
2400
6000
crypto
Multimedia video-rate stereo visionMAC 1000
106
1000
although the effective
integration density of
FPGAs is by 4 orders
of magnitude 103
behind the Moore curve
400
pattern recognition 730
900 288
SPIHT wavelet-based image compression 457
Bioinformatics
1980
© 2006, [email protected]
100
52
FFT
protein identification BLAST
40
Pentium 4
20
wiring overhead
reconfigurability overhead
routing congestion
8080
100
Viterbi Decoding
Smith-Waterman
pattern matching
88 molecular dynamics simulation
GRAPE
Astrophysics
1990
79
2000
2010
http://hartenstein.de
Transdisciplinary Research and Education
TU Kaiserslautern
• working towards the development of principles and ideas for
multidisciplinary modes of research and education.
• There are challenges that cannot be overcome using
methods within a single discipline [A. M. Madni, Ph. C-Y Sheu]
• The transdisciplinary way of acquiring knowledge means that
education, research, development, production, and training
are intertwined in such a way that we obtain a better
picture and a higher level of abstraction.
• This allows us to overcome the shortcomings of the
classical, Cartesian-mechanistic, reductionist foundations,
and methods of traditional sciences and engineering. [A. Ertas,
M. M. Tanik]
© 2006, [email protected]
80
http://hartenstein.de
how science will revolutionize the 21st century
TU Kaiserslautern
The heyday of reductionism has passed.
This is the reason of the growing worldwide
significance of transdisciplinary notions
Impenetrable obstacles have been encountered
which cannot be solved by the classical simple
reductionist approach.
This is heralding a new era
© 2006, [email protected]
81
http://hartenstein.de
Holistic Thinking
TU Kaiserslautern
work towards the development of principles and ideas
for multidisciplinary modes of research and education.
provides us the necessary tools and methods to
well maintain intellectual control over large projects
overcome the shortcomings of the classical, Cartesianmechanistic, reductionist foundations and methods
Herbert A. Simon: The Sciences
of the Artificial; 3rd Edition
holistic thinking vs. mechanistic
thinking, disciplinary vs.
transdisciplinary” thinking,
Nobel Laureate
reductionism vs. holism
© 2006, [email protected]
82
http://hartenstein.de
Scientific Revolutions
TU Kaiserslautern
Thomas S. Kuhn: The Structure
of Scientific Revolutions;
University of Chicago Press, 1962
3rd edition: 1996
http://www.des.emory.edu/mfp/Kuhn.html
Outline and Study Guide, prepared Aug. 2004
by Professor Frank Pajares, Emory University
© 2006, [email protected]
83
http://hartenstein.de
More Books
TU Kaiserslautern
Michio Kaku: Visions: How Science
Will Revolutionize the 21st Century;
ANCHOR, September 1998
Everett M. Rogers, Nancy Singer
Olaguera: Diffusion of
Innovations; Fifth Edition, Simon
& Schuster, August 2003
© 2006, [email protected]
84
http://hartenstein.de
Conclusions
TU Kaiserslautern
excellent results proven for
computationally intensive applications
highly promising for servers
improvements likely for bulk
data & storage applications
tool and language scenario needs an
urgent transdisciplinary clean-up
© 2006, [email protected]
85
http://hartenstein.de