The Artemis Architecture Workbench
Download
Report
Transcript The Artemis Architecture Workbench
Sesame
Opening new doors to Multi-level Design
Space Exploration of
Embedded Systems Architectures
Andy D. Pimentel
Computer Systems Architecture group
University of Amsterdam
Informatics Institute
Thank you.
Questions?
Outline
Background and problem statement
General overview of modeling methodology
Sesame environment
Application modeling layer
Architecture modeling layer
Mapping layer
Gradual refinement of architecture models
Event refinement using dataflow graphs
Both computational and communication refinement
Current status and future work
Embedded media systems
Modern embedded systems for media and signal
processing must
support multiple applications and various standards
often provide real-time performance
These systems increasingly have heterogeneous
system architectures, integrating
Dedicated hardware
• High performance and low power/cost
Embedded processor cores
• High flexibility
Reconfigurable components (e.g. FPGAs)
• Good performance/power/flexibility
Rethinking system design
Design complexity forces us to reconsider current
design practice
Classical design methods
often depart from a single application specification
which is gradually synthesized into HW/SW
implementation
lack generalizability to cope with highly programmable
architectures targeting multiple applications
also hamper extensibility to efficiently support future
applications
Rethinking system design (cont’d)
Traditionally, designers only rely on detailed
simulators for design space exploration
HW/SW co-simulation
This approach becomes infeasible for the early
design stages
Effort to build these simulators is too high as
systems become too complex
The low speeds of these simulators seriously
hamper the architectural exploration
HW/SW co-simulation requires a HW/SW
partitioning
• A new system model is needed for assessment of each
HW/SW partitioning
“Jumping down” the design pyramid
High
Specification
Low
Abstraction
Abstract executable models
Cycle-true simulation models
Effort
Back-of-the-envelope calculations
10000
lines
Mins/
hours
Synthesizable RTL models
10000+
lines
Hours/
days
Low
High
Alternative realizations
Design by stepwise refinement
High
Specification
Explore
Low
Back-of-the-envelope calculations
1000
lines
Effort
Abstraction
Abstract executable models
Cycle-true simulation models
Secs/
minutes
10000
lines
Mins/
hours
Synthesizable RTL models
10000+
lines
Hours/
days
Low
High
Alternative realizations
Sesame
Simulation of Embedded Systems Architectures for Multi-level Exploration
Provides methods and tools to efficiently evaluate
the performance of heterogeneous embedded
systems and explore their design space
Different architectures, applications, and mappings
Different HW/SW partitionings
Smooth transition between abstraction levels
Mixed-level simulations
Promotes reuse of models (re-use of IP)
Targets the multimedia application domain
Techniques and tools also applicable to other
application domains
Y-chart Design Methodology [Kienhuis]
Applications
Architecture
Mapping
Performance
Analysis
Performance
Numbers
Use separate models for application and architecture behavior
Modeling and simulation using the
Y-Chart methodology
Application model
Description of functional behavior of an
application
Independent from architecture, HW/SW
partitioning and timing characteristics
Generates application events representing
the workload imposed on the architecture
Traces of
application
events
Architecture model
Parameterized timing behavior of
architecture components
Models timing consequences of
application events
Application
model
Architecture
model
Explicit mapping of application and architecture models
Trace-driven co-simulation [Lieverse]
Easy reuse of both application and architecture models!
Application modeling
Using Kahn Process Networks (KPNs)
Parallel (C/C++) processes communicating with each
other via unbounded FIFO channels
• expresses parallelism in an application and makes
communication explicit
• blocking reads, non-blocking writes
Generation of application events:
Code is instrumented with annotations describing
computational actions
Reading from/writing to Kahn channels represent
communication behavior
Application events can be very coarse grain like
“compute a DCT” or ”read/write a pixel block”
Application modeling (cont’d)
Why Kahn process networks (KPNs)?
Fit very well to multimedia application domain
KPNs are deterministic
• automatically guarantees validity of event traces when
application and architecture simulators are executed
independently
Application model can also be analyzed in
isolation from any architecture model
Investigation of upper performance bounds and
early recognition of bottlenecks within application
Architecture modeling
Architecture models react to application trace
events to simulate the timing behavior
Accounting for functional behavior is not necessary!
Architecture modeling at varying abstraction levels
Starting at ‘black box’ level
Processing cores can model timing behavior of SW,
HW or reconfigurable execution
• parameterizable latencies for the application events
• SW execution = high latency, HW execution = low latency
Allows for rapid evaluation of different HW/SW
partitionings!
Architecture modeling (cont’d)
Models implemented in Pearl
• Object-based discrete event simulation language
Keeps track of virtual time
Provides simulation primitives
Inter-object communication via message-passing
Keeps track of simulation statistics
• “RISC-like” language: keep it simple and make the
common case fast
Lacks features not needed for architectural modeling
(e.g., no dynamic datastructures, dynamic object
creation, etc.)
• Result: high-performance modeling & simulation
High simulation speed and low modeling effort
Architecture modeling (cont’d)
Models implemented in SystemC
We added a layer on top of SystemC 2.0, called
SCPEx (SystemC Pearl Extension)
• Provides SystemC with Pearl’s message-passing semantics
• Raises abstraction level of SystemC (e.g., no ports,
transparent incorporation of synchronization)
• Improves transaction-level modeling
SCPEx enables reuse of Pearl models in SystemC
context
• Makes Pearl SystemC translation trivial
• Provides link towards possible implementation
• Facilitates importing SystemC IP models in Sesame
Sesame in layers
Kahn
process
Kahn
process
Kahn
process
Virtual
Mapping
processor
Virtual
processor
Application
model
Event trace
Virtual
processor
buffer
Processor
1
buffer
Processor
2
bus
Mem
Mapping
layer
Architecture
model
Sesame’s mapping layer
Maps application tasks (event traces)
to architecture model components
Guarantees deadlock-free scheduling
of application events
Scheduling of communication events
Because Read events are blocking (Kahn),
some schedules may yield deadlock
A
C
B
Read(C)
Write(C)
Proc.
core
Application model
Write(A)
Read(B)
Proc.
core
Architecture model
Bus
Sesame’s mapping layer
Maps application tasks (event traces)
to architecture model components
Guarantees deadlock-free scheduling
of application events
Accounts for synchronization behavior
• Mapping layer executes in same time domain as
architecture model
Transforms application-level events into
primitives (events) for architecture model
• More on this later on...
Tool for auto-generation of mapping layer
Y-chart Modeling Language (YML)
Flexible and persistent description (XML) of
The structure of application and architecture
models (connecting library components)
• SCPEx also supports YML!
The mapping of appl. models onto arch. models
(i.e., the mapping layer)
YML combines scripting language within XML
Simplifies descriptions of complicated structures
Increases expressive power of components
• E.g., a parameterized complex interconnect component
modeling a network of arbitrary size
Increases reusability
• Re-use of components and structures
An illustrative case study: M-JPEG
Lossy, Motion-JPEG encoder
Accepts both RGB and YUV formats
Includes dynamic quality control by on-the-fly
adaptation of quantization and Huffman tables
Video stream
(RGB or YUV)
RGB to YUV
conversion
Video stream
(YUV)
JPEG encoding
M-JPEG encoded
video stream
observed bitrate
The platform architecture
Bus-based shared memory multiprocessor
architecture
VIP
microProcessor
(mP)
DSP1
Memory
DSP2
VOP
M-JPEG case study (cont’d)
RGB to YUV
conversion
Video stream
(YUV)
JPEG encoding
M-JPEG encoded
video stream
observed bitrate
mapping
Video stream
(RGB or YUV)
VIP
microProcessor
(mP)
Exploration
DSP1
Memory
DSP2
VOP
M-JPEG case study (cont’d)
(H,V)
• Kahn Process Network
• Functional
Video behavior
stream
RGB to YUV
Videoconversion
in
DMUX
DMUX
Data blocks
M-JPEG encoded
video stream
{NLP,LP}
YUV blocks (4:1)
(RGB or YUV)
RGB2YUV
Video stream
(YUV)
JPEG encoding
VLE
Quantizer
Q blocks (4:1)
Bitstream packets
VLE
Video out
{(H,V),B,b}
DCT
DCT
observed bitrate
Sequence
of
video frames
Compressed
video frames
OB
Control
B
Event traces
microProcessor
(mP)
mP
o10 o11
VIP
VIP
i6
o9
i3
DSP1
RGB2YUV DSP2
i1
o7
i2
o6
i1
o3
i5 o2 o5 o8 o4
i2
o1
i3
o1 o2
i7
i4
DSP1
DCT
i3
o2
VOP
VLEP
DSP2
o1 i4
i1
o3
i3
i4
VOP
i2 o1 o2
i1
i2
• Library approach
• Timing behavior
BUS(B1)
Memory
line 1
:
line 8
IMAGE BUFF 1
line 1
...
:
line 8
IMAGE BUFF N
HEADER
BUFFER
TABLES
BUFFER
DCT -> Q
BUFFERS
Q -> VLE
BUFFERS
PACKET
BUFFERS
FIFO
FIFO
FIFO
FIFO
STATISTICS
BUFFER
MEMORY
M-JPEG design space exploration
Experimented with different
HW/SW partitionings
Application-architecture mappings
Processor speeds
Interconnect structures (bus, crossbar and Ω networks)
This took about 1 person-month (all modeling
included)
Simulation performance: for 128x128 frames, a
270 MHz Sun Ultra 5 Sparcstation simulated
2,3 frames/second (= 0.43 secs/frame)
M-JPEG design space exploration
M-JPEG design space exploration
Mapping problem: implementation gap
Implementation
Application behavioral
model (what?)
Primitive operations
Primitive operations
Architecture
model
Exploration/
(how?)
refinement
Mapping problem
Application events: Read, Write and Execute
Typical mismatch between application events and
architecture primitives, examples:
• Architecture primitives operating on different data
granularities
• Architecture primitives more refined than application
events
Trace events from the application layer need to
be refined
How?
• Refine the application model
• A transformation mechanism between the application
and architecture models
Communication refinement
Let’s take the mismatch of communication
primitives as an example
Assume following architecture communication
primitives
Check-Data (CD)
Load-Data (Ld)
Signal-Room (SR)
Check-Room (CR)
Store-Data (St)
Signal-Data (SD)
Synchronization primitives
Data movement primitives
Communication refinement (cont’d)
Transformation rules for refining applicationlevel communication events [Lieverse]
R CD Ld SR
W CR St SD
E E
(1)
(2)
(3)
How to transform traces of application events
using (1), (2) and (3)?
Process
A
Process
B
Process
C
while (1) {
compute();
write(block);
}
while (1) {
read(block);
compute();
write(block);
}
while (1) {
Generates
read(block);
REW
compute();
event
sequences
}
Communication refinement (cont’d)
Process
A
Processor
1
Process
B
Processor
2
Process
C
Processor
3
bus
Mem
• Assumption 2:
1: processor 2 has NO
locallocal
(block)
(block)
memory
memory
• Transforming REW event sequences from process B:
R EW CDCRLdEStSRSD
CDLdSRECRStSD
IDF-based trace transformation
Virtual processors in mapping layer are refined
to accomplish trace refinement
Integer-controlled DataFlow (IDF) model
describes internal behavior of a virtual processor
Application events specify
what a virtual processor executes
with whom it communicates
Internal IDF model specifies
how the computations and communications take
place at the architecture layer
IDF-based trace transformation (cont’d)
Process
A
Process
B
Process
C
Application
model
Process network
Virtual
proc. X
Virtual
proc. Y
Virtual
proc. Z
Mapping
layer
Dataflow
Processor
3
Architecture
model
Discrete event
Processor
1
Processor
2
bus
Mem
Communication refinement revisited
Process
A
Process
B
Virtual
proc. X
Virtual
proc. Y
Processor
1
Processor
2
Process
C
Virtual
proc. Z
Processor
3
bus
Mem
• Assumption: processor 2 has NO local (block) memory
• Transforming REW event sequences from process B:
R EW CDCRLdEStSRSD
Communication refinement revisited (2)
Event trace process B
Virtual
processor X
Virtual
processor Y
CR
switch
Virtual
processor Z
R EW
CD
E
CR
b
St
CD
b
X
St
Ld
X-init
decomposes
into
Ld
X-exit
SD
SR
SD
X = {Ld,St,E}
processor
2
Bus
SR
Architecture model
from/to
arch.model
Computational refinement
Process
B
Process
A
Process
C
Virtual
proc. X
Processor
1
Virtual
proc. Z
R
E Processor
E E W
2
Processor
3
bus
Mem
Putting Sesame to use:
An example design flow
Compaan/Laura
(Leiden University)
Architecture
Reconfigurable
+
simulation
architecture
Motion-JPEG
Molen
environment
framework
(Delft
University)
encoder
Detailed performance estimates
Sesame
DCTframework
System-level
architecture
exploration
Applications
Experimentation
Code suitable for
FPGA execution
Proposed architecture
A real implementation using
Compaan/Laura/Molen
The
MOLEN
Prototype
Register File
Main Memory
Instruction
Fetch
Data
Load/Store
ARBITER
DATA
MEMORY
MUX/DEMUX
Core
Processor
Exchange
Registers
Mapping M-JPEG on the Molen
platform architecture
The DCT* kernel
reconfig.
microcode
Recon
unit
CCU
Preshift
fig.
Microcode
block
unit
pixel
IN
pixel
Custom
pixel
Computing
2D-DCT
OUT
unit
pixel
block
Reconfigurable Processor
C++ Compiler
Laura
MJPEG code
DCT*
Pres
hift
pixel
pixel
DCT* kernel
Compaan
in_block
IN
pixel
2DDCT
OUT
pixel
out_block
for k = 1:1:4,
for j = 1:1:64,
[Pixel (k,j)] = In(inBlock);
end
end
for k = 1:1:4,
if k <= 2,
for j = 1:1:64,
[Pixel (k,j)] =
PreShift(Pixel (k,j));
end
end
[Block] = 2D_dct( Pixel );
end
for k = 1:1:4,
for j = 1:1:64,
[outBlock]=Out(Pixel(k,j));
end
end
System-level simulation experiment
Modeling Molen with DCT mapped onto CCU
Validation against real implementation
Information from Compaan/Laura/Molen used
for calibration of architecture model
Apply architecture model refinement
Keep M-JPEG application model untouched
DCT component in architecture model is refined
• Operates at pixel level
• Abstract pipeline model, deeply pipelined
Other architecture components operate at (pixel)block level
Sesame’s IDF-based model refinement
Process
B
Process
A
Virtual
proc. X
Processor
1
Process
C
Virtual
proc. Z
R
E Processor
E E W
2
Processor
3
bus
Mem
Application
M-JPEG
model
MapMapping
DCT on
layer
CCU and
refine
Architecture
Molen
model
repeat-begin
DCT virtual
processor
…,4,4,4
cd/cr
Event trace
1
cd
ld
64
scheduler
64
latency
11..11,11..11,00..00,00..00
Control trace
cr
case-begin
64
preshift
1
in
11..11,11..11,00..00,00..00
Type in
P1
out
case-end t-put
2d-dct
Block out
P2
64
st
1
Block in
sr/sd
…,4,4,4
repeat-end
To/from architecture model
To/from arch. model
Simulation results
Full software implementation
Simulation: 85024000 cycles
Real Molen: 84581250 cycles
Error: 0.5%
DCT mapped onto CCU
Simulation: 40107869
Real Molen: 39369970
Error: 1.9%
No tuning was done!
Where are we going?
Some ongoing and future work
NoC modeling
So far, we mainly modeled bus-based systems
Networks-on-Chip (NoC) will be our (near) future
Standardized interfaces
Scalable (point-to-point) networks
Much more complex protocols (protocol stack?)
QoS aspects
Modeling NoCs
Topologies, switching & routing methods, flowcontrol, protocols, QoS, etc.
Communication mapping
Modeling at multiple abstraction levels
• Gradual refinement
• Role of IDF models
Architecture model calibration
Kahn
process
Initial
of latency
Kahn derivation
Kahn
process
process
parameters:
• documentation
Virtual
processor
• educated guess
•Virtual
performance budgeting
Virtual
processor
processor
(what is the required
parameter
range?)
buffer
buffer
Op Latency
x
50
y 100
z
10
Next step: calibration
with
lower-level,
external
Processor
Processor
simulation models
or prototypes,2 e.g.
1
• Instruction set simulators (ISSs)
• Compaan/Laura framework
Mem
bus
Mixed-level system simulation
“Zoom in” on interesting system components in
architecture model
Simulate these components at a lower level
Retain high abstraction level for other components
Saves modeling effort
May save simulation overhead
Integration of external simulation models
ISSs, SystemC models, etc.
Also allows calibration of higher-level models
BUT…
Mixed-level simulation can be complex!
multiple time domains and time grain sizes (synchronization)
differences in protocol and data granularity of components
Mixed-level system simulation (cont’d)
C
A
D
Embedding
external
B
IDF-based
models
refinement
A’
B’
scheduler
P1
ISS
C’
c’’
P2
SystemC
Mem
P3
Towards real design space exploration
Sesame supplies basic methods & tools for
evaluating application, architecture, and mapping
combinations
Simulating entire design space is not an option
More is needed to explore large design spaces
What will be the initial design(s) to evaluate?
How to react when the evaluated architecture does
not suffice?
We need steering before and during simulation
Design decisions using analytical modeling
• Finding Pareto-optimal candidates using multi-objective
optimization
Design evaluation using simulation
Real design space exploration (cont’d)
Taking into account
performance, power
and cost
Architecture
model
Mapping
Application
models
Analytical
architecture
model
Performance
Analysis
Heuristic methods
Performance
like evolutionary
Numbers
algorithms
Analytical
application
models
Multiobjective
optimization
Candidate
system
architectures
Credits
This work would not have been possible without
the (ground-laying work of the) following people:
Paul Lieverse
Bart Kienhuis
Ed Deprettere
Pieter van der Wolf
Kees Vissers
Vladimir Zivkovic
Todor Stefanov
Cagkan Erbas
Simon Polstra
Berry van Halderen
Joseph Coffland
Frank Terpstra
Mark Thompson
For more information
URL:
www.science.uva.nl/~andy/publications.html
or
email: [email protected]
Sesame software can be found at:
sesamesim.sourceforge.net