Document 7178352

Transcript Document 7178352

System power, size and speed
trade-offs in a
platform-independent
precompiler context
Francky Catthoor, Diederik Verkest
+EMSYS/ SEMP/ADT researchers
IMEC, Leuven, Belgium
MATADOR
© imec 2000
Francky Catthoor, DESICS
IMEC
ACROPOLIS
Illustration of demanding
requirements
MPEG-4 : multimedia
Huge requirements:
> 2 GOP/s
> 6 GB/s
> 10 MB storage
Software specification:
more than 200 000 lines C
hundreds of files
written by approx. 80 teams
© imec 2000
Francky Catthoor, DESICS
Illustration of demanding
requirements
Nowadays implementations:
small images (QCIF: 176x144)
decoding only
not real-time
several W
Wanted features:
large images (TV)
encoding and decoding
real-time
100 mW (mobile)
© imec 2000
Francky Catthoor, DESICS
Goal of our research

© imec 2000
A methodology to map dynamic and concurrent realtime applications on an embedded multi-processor
platform
Francky Catthoor, DESICS
Why are Applications becoming
more dynamic and concurrent?
JPEG
T1
T2
MPEG4
T1’
T1
T4
T3
The workload decreases but the tasks are dynamically created and
their size is data dependent
© imec 2000
Francky Catthoor, DESICS
Very dynamic behaviour so
worst-case realisation too costly
1000
900
proj. surface (a)
Frame period (ms)
& Projected Surface (# pixels/200)
800
700
600
500
measured
period (b)
400
300
predicted
period (c)
200
100
0
0
50
100
150
200
250
300
Frame number
© imec 2000
Francky Catthoor, DESICS
350
400
450
500
550
600
System design issues in
IT-Application domain
OUT
622 Mb/s
ISR
Embedded system => cost
Processes
• Dynamic and
concurrent processes
• Global/local control
• Non-deterministic events
time
out
data in
IN
622 Mb/s FIFO
Packet
Record
Routing
Record
Complex data sets
• Large and irregular
dynamically allocated data
• Huge memory accesses
routing reply
data out
200 accesses
53 cycles
OUT
622 Mb/s
Stringent real-time
constraints
Network layer protocols (ATM, IP)
Dynamic multi-media algorithms (MPEG4/7/21)
Wireless/wired
terminals (Internet, WLAN)
© imec 2000
Francky Catthoor, DESICS
Divide-and-conquer approach of
today leads to local solutions
MPEG-4 : multimedia spec
= too huge to handle as 1 task
=> break up in many interacting tasks
With this code my boss
has to give me a raise!!!
TASK1
Cost(P)
h
Exec.Time
© imec 2000
Francky Catthoor, DESICS
Global picture: ad hoc
This didn’t look so good after all???
TASK1
Processor 1
h
t1
TASK2
h
t2
TASK3
h
t3
© imec 2000
Francky Catthoor, DESICS
t1+t2+t3<T
Global trade-offs with costperformance curves
Luckily we have the Pareto approach!
TASK1
Processor 1
x
h
t1n
TASK2
h
x
t2n
TASK3
hx
t3n
© imec 2000
Francky Catthoor, DESICS
t1n+t2n+t3n<T
Requirements of system level
design approach
Algorithms
+ Data Structures
Architecture
ARM RAM ROM
IP1
Processor architecture
ROM
ROM
custom
logic
micro
processor
MMU
RAM
RAM
© imec 2000
Francky Catthoor, DESICS
DSP
IP2
Global meta flow and models
Algorithm synthesis
Algorithm specification
Java,SDL,C++,C,Matlab
HCDFG+MTG
Task2
Task1
Data type refinement
Optimized system specification
Inter-task DTSE
Task concurrency mngnt
Task3
Task-level system architecture
Instr.-level concurrency mngnt
Proc. arch. integration
Custom proc.
synthesis
High-level
Instr.-set
addr. optim. proc. mapping
Intra-task DTSE
Proc.-level system architecture
Data parallelisation mngnt
Technology integration
Array-level system architecture
Proc2
Proc1
Proc3
© imec 2000
Francky Catthoor, DESICS
Proc.-level DTSE
Reconfigurable
technol.mapping
Applic.specific
technol.mapping
Physical system architecture
Orthogonal decision principle
Decision 1
Array
Linked List
Pointer Array
Decision 2
No
Yes
Hashing function
© imec 2000
Francky Catthoor, DESICS
Binary Tree
Decision 3
Decision 4
Orthogonal decision:
constraint propagation
Excluded because of propagation
Decision 1
Decision 2
Decision 3
Decision 4
© imec 2000
Francky Catthoor, DESICS
Iteration, constraint
propagation and estimators
Constraint
propagation
Optimized system specification
Inter-task DTSE
Local iteration
Task concurrency mngnt
Task-level system architecture
Intra-task DTSE
High-level
estimation
Data parallelisation mngnt
Array-level system architecture
Proc.-level DTSE
Instr.-level concurrency mngnt
Proc. arch. integration
Custom proc.
synthesis
High-level
Instr.-set
addr. optim. proc. mapping
Proc.-level system architecture
© imec 2000
Francky Catthoor, DESICS
Architecture
constraint
propagation
Deeply embedded system
Dedicated logic Interfaces 

RAM & ROM
DMA
phone
phone
book
keypad
book
intfc
control
protocol


S/P
Demod
and
sync
multi-DSP core


Equal.
quality
voice
recognition
enhancement
A
D
digital
down
conv
de-intl
&
decoder

retargetable ASIP
compiler
Memory/MMU
Interfaces

RPE-LTP
speech
decoder

Multi-DSP core
All of this fits in one, cheap, package
Francky Catthoor, DESICS
accelerator
synthesis
Viterbi
speech
© imec 2000
mP core
Dedicated logic
system integration
Analog
Deeply embedded system
mP core
Memory/MMU
RAM & ROM
DMA
S/P
Demod
and
sync

phone
phone
book
keypad
book
intfc
control
protocol



Viterbi

Equal.
speech
quality
mP core
Dedicated logic
multi-DSP core
memory/MMU

voice
recognition
enhancement
A
D
digital
down
conv
de-intl
&
decoder
RPE-LTP
speech
decoder


Analog
All of this fits in one, cheap, package
© imec 2000
Francky Catthoor, DESICS
system layer
compiler
dynamic + static
mem mngnt + addr
expr.
Interfaces
Analog

A/D + RF
Platform mapping stages:
4 crucial tasks
Initial specification
Memory platform-indep.
Data transfer and storage
exploration (DTSE)
Proc. platform-indep.
Task concurrency
exploration
Optimized specification
Retargetable
mem.platform depend.
DTSE compiler
Instr.-level
concurrencyexplor.with
trad./retarget. compiler
Platform arch design
© imec 2000
Francky Catthoor, DESICS
Platform mapping stages
in DESICS-DIMA
Initial specification
Platform-independent
Data transfer and storage and
concurrency exploration
Applied 1 x per new
application
Manually applied with
trainable step-wise
approach
Optimized specification
Retargetable
platform compiler:
DTSE+reconfig.
HW+SW co-refine
Platform arch design
© imec 2000
Francky Catthoor, DESICS
DTSE and HW/SW
co-refinement with
tool support
Instruction-level
concurrency mngnt with
traditional and
retargetable compilers
Application
Executive
Service
Flex
Demux
ALManager
Critical
path
Presenter
Data
Channel
Data
Channel
Decoders
BIFS
Composition
Memory
Decoding
Buffer
Data
Channel
OD
Data
Channel
...
30 msec
Pre/post writing
Pre/post
Decoding
© imec 2000
Francky Catthoor, DESICS
Rendering
C/C++ specification of dynamic concurrent system
Extraction of the gray-box model
Memory architectur
Task-level DTSE
Real-time
constraints
Concurrency improving transformations
Static task scheduling
cost
cost
cost
task1
task2
time
time
Run-time scheduler
processor1
processor2
processor3
© imec 2000
Francky Catthoor, DESICS
platform
Real-time
constraints
task3
time
platform
Real-time
constraints
time
Grey-box vs. white-box
vs. black-box
120 files C++ code => 40,000~80,000 lines of code
=> complexity of unnecessary redundancy
Black-box does not contain sufficient information
for cost-efficient design (even not for real-time)
© imec 2000
Francky Catthoor, DESICS
TaskA
TaskC
TaskB
TaskD
TaskE
© imec 2000
Francky Catthoor, DESICS
Focus of the experiments
at IMEC
Abstract/
analyze/
improve
Concurrency
extraction/
improvement
DTSE constraints
Concurrent model
Real-time issues (Partial) ordering
of objects
Ordered model
Distribute over
architecture
components
Assigned model
Interface
refinement
© imec 2000
Francky Catthoor, DESICS
The 2-processor platform
(schedule+assign)
Task1
© imec 2000
Francky Catthoor, DESICS
Task2
Taskn
ARM
ARM
Processor
Processor
1
Vdd=1V
2
Vdd=3.3V
Not single working point but Pareto
curves needed in global trade-off
Energy (nJ)
3500
Both data transfer-storage
and concurrency aspects
have to be combined!
3000
2500
2000
1500
1000
500
0
0
50
100
150
200
Time budget (us)
© imec 2000
Francky Catthoor, DESICS
250
Comparison for original and transformed
graphs on 2 processors with different Vdd
Energy (nJ)
3500
3000
2500
2000
original
1500
Transformed
1000
500
Time budget (us)
0
0
© imec 2000
Francky Catthoor, DESICS
50
100
150
200
250
Comparison for original and
transformed IM1 graphs on 2
processors with different Vdd
original
Transformed
© imec 2000
Francky Catthoor, DESICS
Comparison between the two transformed
Pareto curve for 2 parallel target alternatives
Energy (nJ)
6
Target architecture 2
5
4
3
2
1
Global Pareto curve=
“hull” of 2 curves
Target architecture 3
6
5
(=part of Global
Pareto curve)
4
3
2
© imec 2000
Francky Catthoor, DESICS
1
Comparison of genetic algorithm (GA)
and heuristic task scheduling results
Energy-cost vs time-budget curve
5800
5600
5400
Energy-cost
5200
5000
GA solutions
Heuristic
4800
Time-budget
© imec 2000
Francky Catthoor, DESICS
434
427
401
396
391
388
385
383
381
4400
379
4600
Memory architecture model:
SDRAM bottleneck
Wide word
Client
On-chip
Cache
Hierarchy
© imec 2000
Francky Catthoor, DESICS
128-1024
bit bus
Data
Cache
and
Bank
combine
Burst mode
Local
Local
Bank 1
Latch
Select
...
Local
Local
Bank
N
Latch
Select
Global
Bank
Select/
Control Address/
Control
Memory architecture model:
cache performance problem
Many misses
Data-paths
Processors
© imec 2000
Francky Catthoor, DESICS
R
e
g
f
Page loading
e.g.16KB
N-port
SRAM
e.g.1MB
1/2-port
SRAM
L1-cache
L2-cache
e.g.256Mb
1-port
(S)DRAM
Main memory
Memory architecture model:
energy consumption problem
Proc.-chip
(S)DRAM
(Main
Memory)
© imec 2000
Francky Catthoor, DESICS
Hard
Disk
Memory architecture model:
system bus load problem
System bus load = crucial bottle-neck
System-chip
On-chip Memory/
Cache Hierarchy
Proc
Data
Paths
L1
L2
bus
Main
system
bus
L2
SDRAM
(Main
Memory)
Other system resources
© imec 2000
Francky Catthoor, DESICS
Disk
access
bus
Hardware
Disk
Data management results for
ATM protocol module in
adaptation layer
ADT
Concrete
Data types
Abstract Data Type Refinement
Factor 5 less accesses (and energy)
Virtual memory mgmt Refinement
Virtual
Memory
Segments&
Pools
Physical
Memories
© imec 2000
Francky Catthoor, DESICS
Factor 3 less energy
Dynamic
Memory
Mngnt.
Physical memory mgmt Refinement
Factor 2 less memory ports
Physical
for same cycle budget (throughput)
Memory
Mngnt.
Why is it important?
Dynamic data set in an ATM switch
ATM
MUX
ATM cells
25 x
21 y
port 1
ATM
MUX
47 x



© imec 2000
155 Mb/sec
Table of active connections
Table size without
optimization: 16,284 Mbytes
Francky Catthoor, DESICS
25 y
Key1 ( VPI[8] ) Key2 ( VCI[16] ) Key3 ( port[8] ) { VPI, VCI, port } [32]
2 = 0000 0010 5 = 0000 0000
0000 0101
1 = 0000 0001
4 = 0000 0100
7 = 0000 0000 0000 0111
2 = 0000 0010
2 = 0000 0010 1 = 0000 0000
0000 0001
1 = 0000 0001
2 = 0000 0010
5 = 0000 0000 0000 0101
2 = 0000 0010
Matisse: ADT refinement of SPP
component in ATM
Binary Tree (BT)
ATM_cell * Data_In;
Association_Table
* Routing_Table;
Array
Linked_List
Binary_Tree
* Routing_Table;
** Routing_Table;
Routing_Table;
key
data
Routing_Table = newLinked_List
Array ();
Binary_Tree
();
();
Association_Table();
Data_In = new ATM_cell();
key
key
data
data
if ( Routing_Table->Lookup(Data_In) ) ...
data
Power function
data
data
data
Array (AR)
Linked List (LL)
key
data
© imec 2000
key
data
Francky Catthoor, DESICS
key
data
key
data
Area function
10 4
10 4
10 3
10 3
10 2
10 2
10 1
10 1
10 0
10 0
Impl. alternatives
Dynamic data set specification
in ATM switch module

Designer selection
Matisse selection

28
29
PA(VPI)
PA(9)
data
data
data
275 000
memory accesses
per 1000 ATM cells
data
data
AR(4)
24
data
data
210
data
PA(5)
data
data
data
25
AR(VCI)
data
data
data
data
data
data
data
20 000 memory accesses
per 1000 ATM cells
110 mW; 135 mm^2
327 mW;601 mm^2
© imec 2000
Francky Catthoor, DESICS
Why is it important?
Design space for ATM switch table
Huge number of alternatives
•3 one-layer
• 1,116 two-layer
• 66,960 three-layer
• 3,883,680 four-layer
• 2**6n n-layer
© imec 2000
Francky Catthoor, DESICS
ATM switch module
exploration and optimization
Optimum network 1

mW

network 2
PA(9) - PA(5) - AR(4) = 430 mW
network 1
20,000

7,000
Optimum network 2

PA(13) - AR(5) = 110 mW
2,000

500

110
1
© imec 2000
Optimum network 1
used in network 2
2
3
Francky Catthoor, DESICS
4
5 layers
PA(13) - AR(5) = 188 mW
VMM = (de)allocation

Virtual Memory


select one
new
VMM
free




Application
© imec 2000
Francky Catthoor, DESICS
Select optimized virtual memory (VM)
organ./ ADT
VM segment (VMS) size
Limit fragmentation
Implement (de-)allocation
Select tracking mechanism
Define VM managers
P- A - T Trade-offs
PA(5)
© imec 2000
Francky Catthoor, DESICS
PA(9)
2 VMS
Size = 137 mm2
Power = 68 mW
AR(4)
AR(4)
AR(4)
2 VMS
Size = 137 mm2
Power = 49 mW
256
256
AR(4)
AR(4)
PA(5)
PA(5)
PA(5)
PA(9)
AR(4)
1 VMS
Size = 133 mm2
Power = 110 mW
PA(9)
PA(5)
PA(5)
32
PA(5)
PA(5)
AR(4)
AR(4)
AR(4)
PA(5)
PA(5)
32
32
PA(9)
PA(5)
PA(5)
256
256
32
VM size for ATM switch
module (network 1 ADT result)
AR(4)
AR(4)
AR(4)
3 VMS
Size = 137 mm2
Power = 37 mW
Data Transfer & Storage
Principles
4 Avoid N-port Memories
Processor
L1
L2
Data Paths Cache Cache
Core
3 Exploit memory hierarchy
Local Latch 1 +
Bank 1
Cache & Bank
Recombine Local Latch N +
Bank N
Off-chip SDRAM
2 Introduce Locality
1 Reduce redundant transfers
6 Exploit limited life-time
and data layout freedom
© imec 2000
Francky Catthoor, DESICS
5 Meet real-time constraints
Systematic System Exploration
for Memory and Power
System Specification
?
? ?
!
? ?
! ?
? ? ?
!
!
? ?
?
Memory Organization
© imec 2000
Francky Catthoor, DESICS
Time - Efficient System Exploration Design
Flow by means of DTSE Tool Feedback
Initial System Specification
design alternatives
?
?
Fast implementation
with tools
Accurate cost figures
to guide decision
© imec 2000
Francky Catthoor, DESICS
System-level
Feedback
Storage Bandwidth Optimization
High Level DTSE stage
Memory
CDFG
Architecture
R(A) R(B) R(C)
R(D)
R(A) W(B) W(C)
W(D) W(A)
A C
t
© imec 2000
R(A)
W(B)
R(D)
W(D)
W(B)
B D
R(C)
R(B)
R(C)
R(D) CB
R(A)
W(A)
Francky Catthoor, DESICS
R(B) R(C)
W(C)
R(A)
CB
R(C) W(A)
t
W(C)
R(A)
W(B)
W(C)
W(D)
R(C)
W(C)
HL Synthesis/Compiler
A
B
C
D
conflict graph (CG)
Conflicts  assign to
• different memories
• multi-port memories
How to achieve lower cost?
Balance bandwidth!
Memory Bandwidth
Required
High
time
Memory Bandwidth
Required
Low
time
© imec 2000
Francky Catthoor, DESICS
More cycles means lower cost
Predefined storage example
Less switching between planes
•Lower performance
•Lower power consumption
© imec 2000
Francky Catthoor, DESICS
Find Pareto points with DTSE
(Codec for high quality
images)
Power
[mW]
Original
300
2-layer cache
250
1-layer cache
200
Global Pareto
Curve
150
0
10
20
MCycles
Source: Binary Tree Predictive Coder
© imec 2000
Francky Catthoor, DESICS
30
Digital audio broadcast
demonstrator

Assume configurable on-chip memory hierarchy
power [mW]
25
20
15
10
5
50,000
© imec 2000
Francky Catthoor, DESICS
100,000
150,000
storage
cycle
budget
Trade off with other tasks:
system-wide
TASK1
Processor 1
TASK2
Processor 2
© imec 2000
Francky Catthoor, DESICS
TASK3
Pareto curves allow
task trade-off decision
TASK-1
TASK-2
1000
TASK-3
12
15
8
10
4
5
500
0
0.0
0
2.0
4.0
6.0
Execution time
Source: Digital Audio Broadcast
© imec 2000
Francky Catthoor, DESICS
0
0
10000
20000
30000
Execution time
40000
0
50000
Execution time
100000
Atomium/Acropolis tool support
© imec 2000
Francky Catthoor, DESICS
Cache management is next
memory bottleneck
Data transfer can be globally optimised (see IMEC’s
ATOMIUM research) BUT cache misses are next
bottleneck (energy, CPU performance, system bus load,...)

Processor
Data Paths
Main
Memory
L1
L2
Cache Cache
Chip
Off-chip SDRAM
 Avoid
cache misses:
 Exploit limited life-time and data
organisation freedom
© imec 2000
Francky Catthoor, DESICS
Initial Data Organization
Improved
Initial
MDOOptimized
Initial
0
0
Current[50176]
Image size :
196 x 256
50176
Base
Address
Modification
0
Memory
Data
Organization
Previous[50176]
Image Size :
196 x 256
100352
V4x[3136]
103488
V4y[3136]
106624
106624
107520
Cache Size = 512 bytes ; Line Size = 16 bytes
© imec 2000
Francky Catthoor, DESICS
Improved Initial Data
Organization
Improved
Initial
Initial
0
MDO Optimized
0
Current[50176]
Image size :
196 x 256
50176
Previous[50176]
Image Size :
196 x 256
Base
Address
Modi- 50176
fication
0
Current[50176]
V4x[3136]
Memory
Data
Organization
53312
Previous[50176]
100352
V4x[3136]
103488
103488
V4y[3136]
106624
V4y[3136]
106624
Cache Size = 512 bytes ; Line Size = 16 bytes
Memory Overhead = 0%
© imec 2000
Francky Catthoor, DESICS
107520
Memory Data Organization
Improved
0
Current[240]
Current[50176]
Image size :
196 x 256
50176
Previous[50176]
Image Size :
196 x 256
Base
Address
Modi- 50176
fication
Current[50176]
V4x[3136]
240
Memory
480
Data
Organ- 496
ization 512
53312
Current[240]
752
Previous[50176]
V4x[3136]
103488
V4y[3136]
V4y[3136]
106624
V4x[16]
V4y[16]
Previous[240]
100352
103488
Previous[240]
106624
992
1008
1024
V4x[16]
V4y[16]
.. ..
.. ..
107008
Current[240]
107248
Cache Size = 512 bytes ; Line Size = 16 bytes
Memory Overhead = 1%
Reduction in Memory Access time = Factor 2.5
Previous[240]
107488
107520
© imec 2000
Francky Catthoor, DESICS
Cache Size
0
Ca che Size
0
MDOOptimized
Initial
Unused Memory
Space
Cache Size
Initial
Cavity Detection Algorithm on
Intel Pentium-MMX (main acc.)
14
12
10
8
6
4
Main Memory Accesses
© imec 2000
Francky Catthoor, DESICS
Local Memory Accesses
Other Adopt
Modulo red (adopt)
Mem data layout
Inplace
D reuse (pixel buffer)
D reuse (line buffer)
Loop Trf
Data Flow Trf
0
Conv + Adopt
2
Conventional
# Accesses & Exec Time
16
Execution Time (sec)
Cavity Detection Algorithm on
Intel Pentium-MMX (+local acc.)
25
20
15
10
Main Memory Accesses
© imec 2000
Francky Catthoor, DESICS
Local Memory Accesses
Other Adopt
Modulo red (adopt)
Mem data layout
Inplace
D reuse (pixel buffer)
D reuse (line buffer)
Loop Trf
Data Flow Trf
0
Conv + Adopt
5
Conventional
# Accesses & Exec Time
30
Execution Time (sec)
Cavity Detection Algorithm on
Intel Pentium-MMX (+speed)
30
25
20
15
10
Main Memory Accesses
© imec 2000
Francky Catthoor, DESICS
Local Memory Accesses
Other Adopt
Modulo red (adopt)
Mem data layout
Inplace
D reuse (pixel buffer)
D reuse (line buffer)
Loop Trf
Data Flow Trf
0
Conv + Adopt
5
Conventional
# Accesses & Exec Time
35
Execution Time (sec)
Cavity Detection Algorithm on
4-way IBM SMP multi-processor

Inter-Processor
Communication is
reduced due to
DTSE
Initial algorithm
heavily degrades
for >3 proc
Execution Time (sec)

4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
1
2
Initial Algorithm
© imec 2000
Francky Catthoor, DESICS
3
DTSE Transformed
4
MPEG - 4 Video Motion
Estimation
Relative
Power
1.0
Task Level DTSE
Processor Level DTSE
Search Area memories
VOP memories
Total Memory Power
0.5
0.0
Optimizations steps
Resulting Power Reduction = 8
© imec 2000
Francky Catthoor, DESICS
Consistent reduction in miss rate for
data-intensive applications

Application: Full-Search Motion Estimation
60%
Miss Rate
40%
Initial direct mapped
20%
Optim. direct mapped
Fully associative
0%
512
1K
Cache Size
© imec 2000
Francky Catthoor, DESICS
2K
Accesses Reduction hints to
Speed Up Factor
Effect of several DTSE optimizations
20.0
Access Reduction
Speed Up Float
Speed Up Int
18.0
16.0
Factor
14.0
12.0
10.0
8.0
6.0
4.0
2.0
0.0
M & D CIF
120 kbps 30 fps
© imec 2000
Francky Catthoor, DESICS
Foreman CIF
450 kbps 25 fps
Cal & Mob CIF
2 Mpbs 30 fps
Consistent Speed Up on
Different Platforms
Framerate (frames/second)
Performance of PI MPEG-4 Video Decoder on
Different Platforms
120.0
Pentium II 350 MHz
HP PA RISC 180 MHz
TriMedia 100 MHz
100.0
80.0
60.0
40.0
20.0
0.0
M & D CIF 120 kbps Foreman CIF 450
30 fps
kbps 25 fps
© imec 2000
Francky Catthoor, DESICS
Cal & Mob CIF 2
Mbps 30 fps
Power Reduced with Factor 21 to 48
Remaing Power (%)
Assesment Memory Power Reduction
(Proprietory Architecture)
5.0
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
M & D CIF 120
kbps 30 fps
© imec 2000
Francky Catthoor, DESICS
Foreman CIF 450
kbps 25 fps
Cal & Mob CIF 2
Mbps 30 fps
Breakdown DTSE/Addr.Opt. in
frame rate for MPEG-4 decoder
Frame rate (fps)
200
Addr.Opt.
DTSE
150
46.34%
Original
Pentium-III (@500 MHz)
100
48.75%
36.50%
50
30
52.59%
29.68%
51.09%
0
© imec 2000
Francky Catthoor, DESICS
4.91%
10.91%
19.23%
m&d
for
c&m
Data organisation for efficient
direct mapped caches
Example: medical imaging (cavity detection)
50
Initial Direct Map:
cheap but inefficient
Miss Rate (%)
45
40
4-Way Associative:
costly but efficient
35
30
Optim. Direct Map:
cheap and efficient
25
20
15
10
5
0
256
512
1K
Cache Size (Bytes)
© imec 2000
Francky Catthoor, DESICS
2K
ADOPT postprocessing allows
global power-speed trade-off
Example: medical imaging (cavity detection)
1
Power (normalised)
.9
Conventional optimisation
.8
.7
Optimal trade-offs
.6
.5
.4
Pentium-III
.3
HP-RISC
.2
TriMedia
.1
.1
.2
.3
.4
.5
.6
.7
.8
Execution time (normalised)
© imec 2000
Francky Catthoor, DESICS
.9
1
Trading-off speed power by controlling
the I-cache for MPEG4
Power (normalised)
1.15
1.05
1
0.95
Reference
0.85
Pentium-III
0.75
TriMedia
0.65
m&d cif
for cif
0.55
c&m cif
0.45
0.4
0.5
0.6
0.7
0.8
Execution time (normalised)
© imec 2000
Francky Catthoor, DESICS
0.9
1
Relative
power
Voice coder (SW cache):
what-if-analsis for cache
organisation
Cache Power
2
Main memory Power
1
0

© imec 2000
32w
64w
96w
128w
256w
Gain in power of additional factor 6 compared
to optimized (platform independent code)
Francky Catthoor, DESICS
512w
Cache Size
[ words ]
Turbo coding principle
Encoder
Decoder
Û
U
C1
C2
Francky Catthoor, DESICS
Y
D1

D2
 -1


© imec 2000
C
Results on turbo
decoder
Original:
bit-rate: 0.07 Mbit/s power: 1.07 mJ/bit
latency: 5900 ms
area: 3.5 mm2
DTSE
optimizations
© imec 2000
bit-rate: 36 Mbit/s
power: 1.85 W (0.05 mJ/bit)
latency: 10 ms
area: 15 mm2
Francky Catthoor, DESICS
Digital Audio Broadcast
receiver should be low power
~1.8Mb/s
© imec 2000
Francky Catthoor, DESICS
Memory Dominance in
Power distribution for DAB
Addressing
&
Global Ctrl
Data path
&
Local Ctrl
Control
4.2%
Address
5.4%
Data path
29.3%
for (i=0; i<n; i ++){
c[i]= a[i*10] * b[i*5];
}
© imec 2000
Francky Catthoor, DESICS
Memory
61.1%
Performance power tradeoff
on platform independent
optimized code
12
Eone_task = 8.03
fmax = 42.6
Note: real case= 7 memories
Energy
10
8
6
4
2
0
© imec 2000
Francky Catthoor, DESICS
0 20 40 60 80 100 120
Cycle Budget (x1000)
P=8.02
50399 cycles/symbol
bandwidth = 42.6 MHz
Final signal to memory
assignment
© imec 2000
Francky Catthoor, DESICS
An overall factor 7.2 power
gain
(without lowering voltage)
Addressing/Control 9.6%
Data-path 29.3%
/2.7
/5.9
25.9%
/7.2
36.1%
38.0%
/11.7
Memory 61.1%
Gain relative to already low power chip of Philips
© imec 2000
Francky Catthoor, DESICS
Perspective

Platform design is crucial way to manage growing
design complexity

Data-dominated applications require matched
approach based on DTSE support

Platform-independent code trafo stage: always
needed and useful

Platform-dependent stage with tool support at IMEC
to achieve global system-wide trade-offs
Come and see the demo after this session!
© imec 2000
Francky Catthoor, DESICS
Acknowledgements (DTSE/Adopt)
Contributions to the Atomium-DTSE methodology
and tools from:
Lode Nachtergaele, Sven Wuytack, Eddy De Greef, Frank
Franssen, Michael van Swaaij, Florin Balasa, Jean-Philippe Diguet,
Ingrid Verbauwhede, Chen-Yi Lee, Michel Eyckmans, Arnout
Vandecappelle, Stefan Janssens, Jan Bormans, Peter Slock, Peeter
Ellervee
Contributions to ADOPT: Miguel Miranda, Martin Janssen, Cedric
Ghez, Arnout Vandecappelle, Sumit Gupta, Heiko Falk, Kris Croes
Additional contributions to the Acropolis-DTSE methodology and
tools from:
Koen Danckaert, Chidamber Kulkarni, Erik Brockmeyer, Kostas
Masselos, Thierry Omnes, Tanja Van Achteren, Thierry Franzetti,
Per Gunnar Kjeldsberg, Antoine Fraboulet, Fabien Coelho
Interesting input and discussions from industrial partners, e.g. Paul
Lippens and Jef Van Meerbergen (Philips), Gjalt de Jong (Alcatel)
© imec 2000
Francky Catthoor, DESICS
Acknowledgements (DMM/TCM)
Contributions to the DMM methodology and tools from:
Chantal Ykman, Sven Wuytack, Julio da Silva, Jurgen Lambrecht,
Pol Marchal
Contributions to the TCM methodology and tools from:
Pol Marchal, Aggeliki Prayati, Chun Wong, Peng Yang, Nathalie
Cossement, Rudy Lauwereins, Johan Cockx, Dirk Desmet
© imec 2000
Francky Catthoor, DESICS