The Artemis Architecture Workbench

Download Report

Transcript The Artemis Architecture Workbench

Sesame
Opening new doors to Multi-level Design
Space Exploration of
Embedded Systems Architectures
Andy D. Pimentel
Computer Systems Architecture group
University of Amsterdam
Informatics Institute
Thank you.
Questions?
Outline



Background and problem statement
General overview of modeling methodology
Sesame environment
 Application modeling layer
 Architecture modeling layer
 Mapping layer

Gradual refinement of architecture models
 Event refinement using dataflow graphs
 Both computational and communication refinement

Current status and future work
Embedded media systems

Modern embedded systems for media and signal
processing must
 support multiple applications and various standards
 often provide real-time performance

These systems increasingly have heterogeneous
system architectures, integrating
 Dedicated hardware
• High performance and low power/cost
 Embedded processor cores
• High flexibility
 Reconfigurable components (e.g. FPGAs)
• Good performance/power/flexibility
Rethinking system design


Design complexity forces us to reconsider current
design practice
Classical design methods
 often depart from a single application specification
which is gradually synthesized into HW/SW
implementation
 lack generalizability to cope with highly programmable
architectures targeting multiple applications
 also hamper extensibility to efficiently support future
applications
Rethinking system design (cont’d)

Traditionally, designers only rely on detailed
simulators for design space exploration
 HW/SW co-simulation

This approach becomes infeasible for the early
design stages
 Effort to build these simulators is too high as
systems become too complex
 The low speeds of these simulators seriously
hamper the architectural exploration
 HW/SW co-simulation requires a HW/SW
partitioning
• A new system model is needed for assessment of each
HW/SW partitioning
“Jumping down” the design pyramid
High
Specification
Low
Abstraction
Abstract executable models
Cycle-true simulation models
Effort
Back-of-the-envelope calculations
10000
lines
Mins/
hours
Synthesizable RTL models
10000+
lines
Hours/
days
Low
High
Alternative realizations
Design by stepwise refinement
High
Specification
Explore
Low
Back-of-the-envelope calculations
1000
lines
Effort
Abstraction
Abstract executable models
Cycle-true simulation models
Secs/
minutes
10000
lines
Mins/
hours
Synthesizable RTL models
10000+
lines
Hours/
days
Low
High
Alternative realizations
Sesame
Simulation of Embedded Systems Architectures for Multi-level Exploration

Provides methods and tools to efficiently evaluate
the performance of heterogeneous embedded
systems and explore their design space






Different architectures, applications, and mappings
Different HW/SW partitionings
Smooth transition between abstraction levels
Mixed-level simulations
Promotes reuse of models (re-use of IP)
Targets the multimedia application domain
 Techniques and tools also applicable to other
application domains
Y-chart Design Methodology [Kienhuis]
Applications
Architecture
Mapping
Performance
Analysis
Performance
Numbers
Use separate models for application and architecture behavior
Modeling and simulation using the
Y-Chart methodology

Application model
 Description of functional behavior of an
application
 Independent from architecture, HW/SW
partitioning and timing characteristics
 Generates application events representing
the workload imposed on the architecture

Traces of
application
events
Architecture model
 Parameterized timing behavior of
architecture components
 Models timing consequences of
application events

Application
model
Architecture
model
Explicit mapping of application and architecture models
 Trace-driven co-simulation [Lieverse]
 Easy reuse of both application and architecture models!
Application modeling

Using Kahn Process Networks (KPNs)
 Parallel (C/C++) processes communicating with each
other via unbounded FIFO channels
• expresses parallelism in an application and makes
communication explicit
• blocking reads, non-blocking writes

Generation of application events:
 Code is instrumented with annotations describing
computational actions
 Reading from/writing to Kahn channels represent
communication behavior
 Application events can be very coarse grain like
“compute a DCT” or ”read/write a pixel block”
Application modeling (cont’d)

Why Kahn process networks (KPNs)?
 Fit very well to multimedia application domain
 KPNs are deterministic
• automatically guarantees validity of event traces when
application and architecture simulators are executed
independently

Application model can also be analyzed in
isolation from any architecture model
 Investigation of upper performance bounds and
early recognition of bottlenecks within application
Architecture modeling

Architecture models react to application trace
events to simulate the timing behavior
 Accounting for functional behavior is not necessary!

Architecture modeling at varying abstraction levels
 Starting at ‘black box’ level
 Processing cores can model timing behavior of SW,
HW or reconfigurable execution
• parameterizable latencies for the application events
• SW execution = high latency, HW execution = low latency
 Allows for rapid evaluation of different HW/SW
partitionings!
Architecture modeling (cont’d)

Models implemented in Pearl
• Object-based discrete event simulation language
Keeps track of virtual time
 Provides simulation primitives
 Inter-object communication via message-passing
 Keeps track of simulation statistics

• “RISC-like” language: keep it simple and make the
common case fast

Lacks features not needed for architectural modeling
(e.g., no dynamic datastructures, dynamic object
creation, etc.)
• Result: high-performance modeling & simulation

High simulation speed and low modeling effort
Architecture modeling (cont’d)

Models implemented in SystemC
 We added a layer on top of SystemC 2.0, called
SCPEx (SystemC Pearl Extension)
• Provides SystemC with Pearl’s message-passing semantics
• Raises abstraction level of SystemC (e.g., no ports,
transparent incorporation of synchronization)
• Improves transaction-level modeling
 SCPEx enables reuse of Pearl models in SystemC
context
• Makes Pearl  SystemC translation trivial
• Provides link towards possible implementation
• Facilitates importing SystemC IP models in Sesame
Sesame in layers
Kahn
process
Kahn
process
Kahn
process
Virtual
Mapping
processor
Virtual
processor
Application
model
Event trace
Virtual
processor
buffer
Processor
1
buffer
Processor
2
bus
Mem
Mapping
layer
Architecture
model
Sesame’s mapping layer


Maps application tasks (event traces)
to architecture model components
Guarantees deadlock-free scheduling
of application events
Scheduling of communication events
Because Read events are blocking (Kahn),
some schedules may yield deadlock
A
C
B
Read(C)
Write(C)
Proc.
core
Application model
Write(A)
Read(B)
Proc.
core
Architecture model
Bus
Sesame’s mapping layer



Maps application tasks (event traces)
to architecture model components
Guarantees deadlock-free scheduling
of application events
Accounts for synchronization behavior
• Mapping layer executes in same time domain as
architecture model

Transforms application-level events into
primitives (events) for architecture model
• More on this later on...

Tool for auto-generation of mapping layer
Y-chart Modeling Language (YML)

Flexible and persistent description (XML) of
 The structure of application and architecture
models (connecting library components)
• SCPEx also supports YML!
 The mapping of appl. models onto arch. models
(i.e., the mapping layer)

YML combines scripting language within XML
 Simplifies descriptions of complicated structures
 Increases expressive power of components
• E.g., a parameterized complex interconnect component
modeling a network of arbitrary size
 Increases reusability
• Re-use of components and structures
An illustrative case study: M-JPEG

Lossy, Motion-JPEG encoder
 Accepts both RGB and YUV formats
 Includes dynamic quality control by on-the-fly
adaptation of quantization and Huffman tables
Video stream
(RGB or YUV)
RGB to YUV
conversion
Video stream
(YUV)
JPEG encoding
M-JPEG encoded
video stream
observed bitrate
The platform architecture

Bus-based shared memory multiprocessor
architecture
VIP
microProcessor
(mP)
DSP1
Memory
DSP2
VOP
M-JPEG case study (cont’d)
RGB to YUV
conversion
Video stream
(YUV)
JPEG encoding
M-JPEG encoded
video stream
observed bitrate
mapping
Video stream
(RGB or YUV)
VIP
microProcessor
(mP)
Exploration
DSP1
Memory
DSP2
VOP
M-JPEG case study (cont’d)
(H,V)
• Kahn Process Network
• Functional
Video behavior
stream
RGB to YUV
Videoconversion
in
DMUX
DMUX
Data blocks
M-JPEG encoded
video stream
{NLP,LP}
YUV blocks (4:1)
(RGB or YUV)
RGB2YUV
Video stream
(YUV)
JPEG encoding
VLE
Quantizer
Q blocks (4:1)
Bitstream packets
VLE
Video out
{(H,V),B,b}
DCT
DCT
observed bitrate
Sequence
of
video frames
Compressed
video frames
OB
Control
B
Event traces
microProcessor
(mP)
mP
o10 o11
VIP
VIP
i6
o9
i3
DSP1
RGB2YUV DSP2
i1
o7
i2
o6
i1
o3
i5 o2 o5 o8 o4
i2
o1
i3
o1 o2
i7
i4
DSP1
DCT
i3
o2
VOP
VLEP
DSP2
o1 i4
i1
o3
i3
i4
VOP
i2 o1 o2
i1
i2
• Library approach
• Timing behavior
BUS(B1)
Memory
line 1
:
line 8
IMAGE BUFF 1
line 1
...
:
line 8
IMAGE BUFF N
HEADER
BUFFER
TABLES
BUFFER
DCT -> Q
BUFFERS
Q -> VLE
BUFFERS
PACKET
BUFFERS
FIFO
FIFO
FIFO
FIFO
STATISTICS
BUFFER
MEMORY
M-JPEG design space exploration

Experimented with different






HW/SW partitionings
Application-architecture mappings
Processor speeds
Interconnect structures (bus, crossbar and Ω networks)
This took about 1 person-month (all modeling
included)
Simulation performance: for 128x128 frames, a
270 MHz Sun Ultra 5 Sparcstation simulated
2,3 frames/second (= 0.43 secs/frame)
M-JPEG design space exploration
M-JPEG design space exploration
Mapping problem: implementation gap
Implementation
Application behavioral
model (what?)
Primitive operations
Primitive operations
Architecture
model
Exploration/
(how?)
refinement
Mapping problem
Application events: Read, Write and Execute
 Typical mismatch between application events and
architecture primitives, examples:

• Architecture primitives operating on different data
granularities
• Architecture primitives more refined than application
events
Trace events from the application layer need to
be refined
 How?

• Refine the application model
• A transformation mechanism between the application
and architecture models
Communication refinement


Let’s take the mismatch of communication
primitives as an example
Assume following architecture communication
primitives






Check-Data (CD)
Load-Data (Ld)
Signal-Room (SR)
Check-Room (CR)
Store-Data (St)
Signal-Data (SD)
Synchronization primitives
Data movement primitives
Communication refinement (cont’d)

Transformation rules for refining applicationlevel communication events [Lieverse]
 R  CD  Ld  SR
 W  CR  St  SD
 E E

(1)
(2)
(3)
How to transform traces of application events
using (1), (2) and (3)?
Process
A
Process
B
Process
C
while (1) {
compute();
write(block);
}
while (1) {
read(block);
compute();
write(block);
}
while (1) {
Generates
read(block);
REW
compute();
event
sequences
}
Communication refinement (cont’d)
Process
A
Processor
1
Process
B
Processor
2
Process
C
Processor
3
bus
Mem
• Assumption 2:
1: processor 2 has NO
locallocal
(block)
(block)
memory
memory
• Transforming REW event sequences from process B:
R EW  CDCRLdEStSRSD
CDLdSRECRStSD
IDF-based trace transformation
Virtual processors in mapping layer are refined
to accomplish trace refinement
 Integer-controlled DataFlow (IDF) model
describes internal behavior of a virtual processor
 Application events specify

 what a virtual processor executes
 with whom it communicates

Internal IDF model specifies
 how the computations and communications take
place at the architecture layer
IDF-based trace transformation (cont’d)
Process
A
Process
B
Process
C
Application
model
Process network
Virtual
proc. X
Virtual
proc. Y
Virtual
proc. Z
Mapping
layer
Dataflow
Processor
3
Architecture
model
Discrete event
Processor
1
Processor
2
bus
Mem
Communication refinement revisited
Process
A
Process
B
Virtual
proc. X
Virtual
proc. Y
Processor
1
Processor
2
Process
C
Virtual
proc. Z
Processor
3
bus
Mem
• Assumption: processor 2 has NO local (block) memory
• Transforming REW event sequences from process B:
R EW  CDCRLdEStSRSD
Communication refinement revisited (2)
Event trace process B
Virtual
processor X
Virtual
processor Y
CR
switch
Virtual
processor Z
R EW
CD
E
CR
b
St
CD
b
X
St
Ld
X-init
decomposes
into
Ld
X-exit
SD
SR
SD
X = {Ld,St,E}
processor
2
Bus
SR
Architecture model
from/to
arch.model
Computational refinement
Process
B
Process
A
Process
C
Virtual
proc. X
Processor
1
Virtual
proc. Z
R
E Processor
E E W
2
Processor
3
bus
Mem
Putting Sesame to use:
An example design flow
Compaan/Laura
(Leiden University)
Architecture
Reconfigurable
+
simulation
architecture
Motion-JPEG
Molen
environment
framework
(Delft
University)
encoder
Detailed performance estimates
Sesame
DCTframework
System-level
architecture
exploration
Applications
Experimentation
Code suitable for
FPGA execution
Proposed architecture
A real implementation using
Compaan/Laura/Molen
The
MOLEN
Prototype
Register File
Main Memory
Instruction
Fetch
Data
Load/Store
ARBITER
DATA
MEMORY
MUX/DEMUX
Core
Processor
Exchange
Registers
Mapping M-JPEG on the Molen
platform architecture
The DCT* kernel
reconfig.
microcode
Recon
unit
CCU
Preshift
fig.
Microcode
block
unit
pixel
IN
pixel
Custom
pixel
Computing
2D-DCT
OUT
unit
pixel
block
Reconfigurable Processor
C++ Compiler
Laura
MJPEG code
DCT*
Pres
hift
pixel
pixel
DCT* kernel
Compaan
in_block
IN
pixel
2DDCT
OUT
pixel
out_block
for k = 1:1:4,
for j = 1:1:64,
[Pixel (k,j)] = In(inBlock);
end
end
for k = 1:1:4,
if k <= 2,
for j = 1:1:64,
[Pixel (k,j)] =
PreShift(Pixel (k,j));
end
end
[Block] = 2D_dct( Pixel );
end
for k = 1:1:4,
for j = 1:1:64,
[outBlock]=Out(Pixel(k,j));
end
end
System-level simulation experiment

Modeling Molen with DCT mapped onto CCU
 Validation against real implementation


Information from Compaan/Laura/Molen used
for calibration of architecture model
Apply architecture model refinement
 Keep M-JPEG application model untouched
 DCT component in architecture model is refined
• Operates at pixel level
• Abstract pipeline model, deeply pipelined
 Other architecture components operate at (pixel)block level
Sesame’s IDF-based model refinement
Process
B
Process
A
Virtual
proc. X
Processor
1
Process
C
Virtual
proc. Z
R
E Processor
E E W
2
Processor
3
bus
Mem
Application
M-JPEG
model
MapMapping
DCT on
layer
CCU and
refine
Architecture
Molen
model
repeat-begin
DCT virtual
processor
…,4,4,4
cd/cr
Event trace
1
cd
ld
64
scheduler
64
latency
11..11,11..11,00..00,00..00
Control trace
cr
case-begin
64
preshift
1
in
11..11,11..11,00..00,00..00
Type in
P1
out
case-end t-put
2d-dct
Block out
P2
64
st
1
Block in
sr/sd
…,4,4,4
repeat-end
To/from architecture model
To/from arch. model
Simulation results

Full software implementation
 Simulation: 85024000 cycles
 Real Molen: 84581250 cycles
 Error: 0.5%

DCT mapped onto CCU
 Simulation: 40107869
 Real Molen: 39369970
 Error: 1.9%

No tuning was done!
Where are we going?
Some ongoing and future work
NoC modeling


So far, we mainly modeled bus-based systems
Networks-on-Chip (NoC) will be our (near) future





Standardized interfaces
Scalable (point-to-point) networks
Much more complex protocols (protocol stack?)
QoS aspects
Modeling NoCs
 Topologies, switching & routing methods, flowcontrol, protocols, QoS, etc.
 Communication mapping
 Modeling at multiple abstraction levels
• Gradual refinement
• Role of IDF models
Architecture model calibration
Kahn
process
Initial
of latency
Kahn derivation
Kahn
process
process
parameters:
• documentation
Virtual
processor
• educated guess
•Virtual
performance budgeting
Virtual
processor
processor
(what is the required
parameter
range?)
buffer
buffer
Op Latency
x
50
y 100
z
10
Next step: calibration
with
lower-level,
external
Processor
Processor
simulation models
or prototypes,2 e.g.
1
• Instruction set simulators (ISSs)
• Compaan/Laura framework
Mem
bus
Mixed-level system simulation

“Zoom in” on interesting system components in
architecture model
 Simulate these components at a lower level

Retain high abstraction level for other components
 Saves modeling effort
 May save simulation overhead

Integration of external simulation models
 ISSs, SystemC models, etc.



Also allows calibration of higher-level models
BUT…
Mixed-level simulation can be complex!
 multiple time domains and time grain sizes (synchronization)
 differences in protocol and data granularity of components
Mixed-level system simulation (cont’d)
C
A
D
Embedding
external
B
IDF-based
models
refinement
A’
B’
scheduler
P1
ISS
C’
c’’
P2
SystemC
Mem
P3
Towards real design space exploration

Sesame supplies basic methods & tools for
evaluating application, architecture, and mapping
combinations
 Simulating entire design space is not an option

More is needed to explore large design spaces
 What will be the initial design(s) to evaluate?
 How to react when the evaluated architecture does
not suffice?

We need steering before and during simulation
 Design decisions using analytical modeling
• Finding Pareto-optimal candidates using multi-objective
optimization
 Design evaluation using simulation
Real design space exploration (cont’d)
Taking into account
performance, power
and cost
Architecture
model
Mapping
Application
models
Analytical
architecture
model
Performance
Analysis
Heuristic methods
Performance
like evolutionary
Numbers
algorithms
Analytical
application
models
Multiobjective
optimization
Candidate
system
architectures
Credits
This work would not have been possible without
the (ground-laying work of the) following people:







Paul Lieverse
Bart Kienhuis
Ed Deprettere
Pieter van der Wolf
Kees Vissers
Vladimir Zivkovic
Todor Stefanov






Cagkan Erbas
Simon Polstra
Berry van Halderen
Joseph Coffland
Frank Terpstra
Mark Thompson
For more information
URL:
www.science.uva.nl/~andy/publications.html
or
email: [email protected]
Sesame software can be found at:
sesamesim.sourceforge.net