Slide 1

Transcript Slide 1

Memory Oriented System-level Optimizations
for Scripting Enabled Embedded Systems
Jiwon Hahn
PhD Qualifying Exam
University of California, Irvine
March 2006
Motivation
▶ Embedded system development
 Growing challenges
 Increasing end-user’s expectation




More functionality
Higher performance
Cheaper
Smaller
eco node
motion sensing
physiological
sensing
structural health preterm infant
monitoring
monitoring
 Very short time-to-market
 Wide gap between available techniques and user
satisfaction
 Need new tools and methodology!
Jiwon Hahn, UC Irvine
2
Strategies
 Speed up the development!
 Need better programming/debugging methodology
and tool
 Improve the current system’s bottleneck!
 Memory unit is one of the most costly components,
and affects system’s performance, power, and
overall application range
 Maximize the system’s capability!
 Since embedded system is resource constrained, it
helps to partition the system workload to the host
Jiwon Hahn, UC Irvine
3
About My Research
 Framework
 Enhanced programming/debugging methodology
 Host-assisting runtime environment
 Optimization
 Reducing data memory requirements and
increasing memory utilization
 Power and performance co-optimization
Jiwon Hahn, UC Irvine
4
Outline





Scripting Framework
Memory-oriented Optimization
Implementation
Experimental Platforms
Summary & Research Plan
Jiwon Hahn, UC Irvine
5
Outline
▶ Scripting Framework
⊳ Scripting Engine Synthesis
⊳ Runtime Environment
⊳ Preliminary Results




Memory-oriented Optimization
Implementation
Experimental Platforms
Summary & Research Plan
Jiwon Hahn, UC Irvine
6
Motivating Example
▶ Building a small embedded system
 Application
 Hardware
 temperature sensor
 Solder RF module
 sense temperature,
 send to the host every 5 min.
 Software (or Firmware)
 no OS support!
 no interactivity
 no partial testing
 Platform
 TecO particle
17 x 35 mm
PIC18LF452 at 20 MHz
32KB program Flash
1.5KB RAM
32KB external EEPROM
temperature sensor
RF interface
Etc.
1. Write the FW (C/assembly)
2. Compile
repeat








3. Connect board to the host
4. Enter the bootloading mode
5. Erase/Load/Verify Program
6. Restart the board
7. Run
Jiwon Hahn, UC Irvine
7
Motivation
▶ Alternative approach: Scripting!
1. Generate the FW
(Scripting engine synthesis)
2. Compile
 Scripting
repeat
 Environment Setup
3. Connect board to the host
1. Write the script
2. Connect board to the host
3. Load & Run
4. Enter the bootloading mode
5. Erase/Load/Verify Program
6. Restart the board
7. Run
Scripting Engine Synthesis
Jiwon Hahn, UC Irvine
+
Runtime
8
Motivation
▶ Scripting vs. Traditional Programming
Aspects
Traditional
Scripting
Language
C, Assembly
less human readable
Python, Tcl, Perl, …
higher level
System Query
No interactivity
Instant feedback
need oscilloscope, multimeter
to check the status
System Update Recompile, reboot required
Code Size
5x~ 10x more lines
On-the-fly
Shorter
[J. Ousterhout ’98]
Performance
Overhead
Jiwon Hahn, UC Irvine
None
Scripting enginedependant
(could be None or less)
9
Related Work
▶ Frameworks for runtime support
Name
high level
(language)
interacti
vity
reconfigu
rability
SOS
no
Mate
(C)
no
yes*
yes
yes
20K
no
(asm-like)
no
yes*
no
no
39K
TinyOS
no
(nesC)
no
yes
yes*
no
18K
Agilla
no
(asm-like)
yes
yes*
no
no
55K
Pushpin
no
(C-subset)
no
yes*
no
(berthaOS)
no
34K
Sensorware
yes* (Tcl)
yes
yes*
no
no
>237K
Actornet
yes* (S-expression)
N/A
yes
no
no
<128K
VM*
yes (java)
no
yes*
yes
N/A
25K
Our work
yes (python-like)
yes
yes
yes
yes
<17K
Jiwon Hahn, UC Irvine
kernel
synthesis
hetero.
sys.
code size
10
Our Framework: Rappit
▶ Overview
Host
Target System
Rappit S/W
Wired/Wireless link
>> readTemperature()
52
Application
Script
Receive packets
InterpretRappit
the command
F/W
Execute primitives
(e.g., ADC
read)
Device
Drivers
Return the result
#include <stdio.h>
void main(void)
{
int a;
.
.
For(i=0;i<2;i++)
{
.
a =b * c;
}
.
.
return;
}
H/W Device
Framework to provide user an integrated scripting environment
of the host and target systems
Jiwon Hahn, UC Irvine
11
Rappit
▶ Scripting engine synthesis
System Description
Architecture
Interactive
Language
Host
Jiwon Hahn, UC Irvine
Application
Communication
// part of Scripting engine
switch (opcode)
for an RF module
{ # example: pin mapping
// part of primitives
Code # instantiate
Component
mcucase
= MCU(ATmega169)
an atmega169 MCU
0x00:
char ADC_read(void)
Synthesis# load a transceiver
Library
import val
RF = ADC_read();
module
{
rf case
= RF(nRF2401)
# instantiate nRF2401
0x01:
…
rf.CS =RF_send(val);
mcu.PORTB[0]
# connect the chip select pin
}
rf.CE
mcu.PORTB[1]
# connect the chip enable pin
case= 0x02:
rf.DR1 RF_packetize(val);
= mcu.PORTB[2]
# connect the data ready pin
void RF_send(char
pck)
Binary
rf.CLK1
= mcu.PORTF[1] # connect the clock pin
…
{
Executable
rf.DOUT1 = mcu.PORTF[2]
the data pin
Compatible # connect
}Host
Target
F/W …
S/W
Message format
(Scripting }
(Parser,
# example: packet format
Engine,
MsgGen,
c_format = src(1),dst(1),msgID(1),opcode(1),arg(3),crc(1)
r_format
=
src(1),dst(1),msgID(1),mtype(1),dtype(1),\
Target
Primitives,…)
GUI, …)
data(v), crc(1),eop(1)
System
12
Rappit
▶ Runtime environment
Host
Host Assisting modules
Jiwon Hahn, UC Irvine
Pck
Buffer
Packet
Manager
Pcktzer/
Depcktzer
Pcktzer/
Dispatcher
Component
Library
Msg
Generator
Optimizer
Parser
GUI
Target System
Admission
Controller
Scripting
Engine
Native
Routines
command
response
13
Rappit
▶ Host assistance
 Script Parsing (Parser)
“readTemp()”
• User friendly
Syntax
Host Parser,
Msg. generator
To target node
“0x4A0x01”
• Easy to parse at node
• Compact and efficient
representation
 Memory Management (Optimizer)
Raw script
• Written by user
Jiwon Hahn, UC Irvine
Script Scheduler,
Buffer Mapper
To target node
Optimized script
• Minimal script size
• Minimized memory usage
• Minimized runtime overhead
(Fixed schedule and buffer usage)
14
Rappit
▶ Scripting examples
 Interactive port-setting
>>
>>
>>
>>
0
>>
>>
>>
1
PORTA[2]
PORTA[2]
PORTA[1]
PORTA[0]
=
=
=
#
1 # toggle clock
0
1 # set port A pin 1
read input pin
PORTA[2] = 1
PORTA[2] = 0 # toggle clock
PORTA[0] # read input pin
 System configuration
>> mcu.sysclock = 1 MHz
>> uart.baudrate = 9600 bps
>> rf.power = -5 db
>> rf.speed = 1 Mbps
>> rf.config # query
{’payload’: 1, ’power’: -5,
’speed’: 1000000,
’channel’:100, ’mode’: TX’}
 Periodic-task scheduling
>> s = (every 50 ms: sample())
>> s.start()
>> s.stop()
Jiwon Hahn, UC Irvine
15
Rappit
▶ Experimental platform
 AVR Butterfly Board
 Atmel ATmega169
 8-bit MCU @ 8MHz, 512B
EEPROM, 1KB SRAM,
16KB program flash
 Includes dataflash,
speaker, sensors, joystick, LCD
 USART serial link at 9600 baud
AVR Butterfly
Jiwon Hahn, UC Irvine
AVR Butterfly w/ Wireless module
16
Rappit
▶ Experimenting metrics and modality
 Observation Metrics
Metric
Unit
Code size
Bytes
Execution Speed
Cmds/sec
 Execution Modality
Modality
Approach
Programming Method
Native
Compiled
Program the firmware onto the Flash
Batch
Scripting
Preload a script program onto the RAM
Interactive
Scripting
Send one line of command to the RAM
Jiwon Hahn, UC Irvine
17
Rappit
▶ Preliminary results
 Code size reduction
 61.8 – 66.3% reduction
 Scripting engine consists a thin
layer
 Most reduction in application
code size
Jiwon Hahn, UC Irvine
 Performance overhead
 Batch mode scripting can be
faster than native!
 Observed up to 25.7%
speed-up
18
Outline
 Scripting Framework
▶ Memory-oriented Optimization
⊳ Memory Optimization
⊳ Multi-metric Optimization
 Implementation
 Experimental Platforms
 Summary & Research Plan
Jiwon Hahn, UC Irvine
19
Motivating Example
▶ Installing Rappit primitives on Butterfly
 Problem Arise
 Problem Analysis
 Choose primitives
 ADC_read, RF_send,
RF_read, SD_write,
SD_read, …
 Compile & Install
 Runtime Error!
 Why?
.data
SD_buffer
512B
.bss
static unsigned char sd_buffer[512];
heap
1KB
RF_buffer
static unsigned char rf_buffer[30];
char error_msg1 = “No SD Card detected!”;
static unsigned char ADC_buffer[30];
ADC_buffer
char error_msg2 = “Card Read Error!”;
…
Static strings
…
 exceeded 1KB RAM usage
 Solution
 Sharing memory space
 Mapping static data to
dataflash
stack
SRAM
Memory Sharing
Map to dataflash
1KB
heap
Shared_buffer
600B ?
 Result
 Increased board capability
 Increased application range
Jiwon Hahn, UC Irvine
stack
SRAM
20
Data Memory Minimization
▶ Assumptions and Approach
 Assumptions
 Optimizing scripts
 script size  buffer size
 Optimizing at runtime
 Need low complexity algorithm
 Approach




High-level optimization
Using scheduling and buffer mapping techniques
Priority on data memory minimization
Based on model of computation (MoC)
Jiwon Hahn, UC Irvine
21
Models of Computation (MoC)
 Synchronous Dataflow (SDF) [E. Lee ’87]
 Extensively used as specification for blockdiagram based programming environments for
signal processing
 Special case of dataflow
 No notion of time
 The number of tokens (=data) consumed and produced
by each actor (=node) during each firing (=invocation)
cycle is statically fixed.
 Fractional Rate Dataflow (FRDF) [H. Oh, S. Ha ’02]
 Extension of SDF that allows fractional flow of I/O
samples of the original SDF
Jiwon Hahn, UC Irvine
22
Why SDF?
 Formal representation for optimization, simulation
and analysis
 System-level optimization
 Application flow of various primitives
 Static scheduling
 Minimize runtime overhead for resource constrained
embedded systems
 Deadlock detection
 Bounding the memory requirements
 Good match for sensor applications
 collect data, process, transmit
Jiwon Hahn, UC Irvine
23
SDF
▶ Notations
 SDF graph G = (V, E, p, c)
 V: {v1, v2, … v|V|}
 E: {e1, e2, … e|E|}




e1
v1
1
e2
2
v2
2
e3
1
v3
3
e|E|
…
…
5
v|V|
src(e) : source node
snk(e): sink node
p(e) : produce rate src(e1) p(e1) c(e1) snk(e1)
-c(e) : consume rate
v1 v2 v3 … v|V|
 T(e,v): topology matrix
 p(e) if v = src(e),
 -c(e) if v = snk(e)
 0 otherwise
Jiwon Hahn, UC Irvine
T=
e1
e2
e3
…
e|E|
1
0
0
-2
2
0
0
0
0 … 0
-1 … 0
3 …
…
0 … -5
24
SDF
▶ Example
 Surge Application
A
ADC
read




x
1
1
Actors: A, B, C
Buffers: x, y
Schedule: ABC
Rappit Script (4L):
Jiwon Hahn, UC Irvine
C
B
RF
pack
y
1
1
RF
send
every 2048:
x = ADC.read()
y = RF.pack(x)
RF.send(y)
25
SDF
▶ Example (cont’d)
 Same code in Java (20L) [J. Koshy ’05]:
SurgePacket sgPkt;
char eList, eVector;
byte sHandle;
sgPkt = new SurgePacket();
evList = Select.setEventId( eList, Events.TIMEOUT | Events.RADIO RECV );
sHandle = Select.requestSelectHandle();
char val;
Clock.startTimeout( 2048 );
while (true) {
eVector = Select.select(sHandle, eList);
if (Select.eventOccurred( eVector, Events.TIMEOUT )) {
val = PhotoSensor.sense();
sgPkt.setReading( val );
Surge.sendPacket( sgPkt );
Clock.startTimeout( 2048 );
}
else if (Select.eventOccurred( eVector, Events.RADIO RECV)) {
handleRadioEvent( sgPkt ); // if base, forward to uart
}
}
Jiwon Hahn, UC Irvine
26
Problem Statements
1. Find the best schedule and buffer mapping
that minimizes the buffer size requirement


Goal-oriented
Previous work
2. Find the best schedule and buffer mapping
that fits into, and maximizes the utilization of
a given memory size



Constraint-driven
Novel
Practical
Jiwon Hahn, UC Irvine
27
Buffer Mapping Problem
▶ Spatial representation
 Token-lifetime chart (t-chart)
 row: token’s lifetime, produced  placed  consumed
 column: fixed number of token changes caused by firing event
local
buffer
x
t2 
t2 
t1 
t1 
y
A
Jiwon Hahn, UC Irvine
t2 
t4 
t4 
t3 
t3 
t3 
B
B
C
t4 
C
time
28
Buffer Mapping Problem
▶ Spatial representation (cont’d)
 Memory-usage profile (m-profile)
memory
A
B
B
C
C
time
 Metrics
 Msize = 4, Mtotal = 20, Mused = 11, Mwasted = 9, Mutil = 55%
 T=5
Jiwon Hahn, UC Irvine
29
Related Work
▶ Data memory optimization based on MoC
Technique
Group
Idea
Optimal Scheduling
[Bhattacharyya et al] in
Ptolemy Group
Buffer minimized by optimal
scheduling, optimize each local
buffer
Buffer sharing by
lifetime analysis
[Bhattacharyya et al] in
Ptolemy Group, [Ha et al] in
PeaCE group, [Ritz et al] in
Meyr Group
Local buffer lifetime is analyzed to
share global buffers
Buffer merging
[Bhattacharyya et al] in
Ptolemy Group
Input/output buffer is shared (finer
grain than buffer sharing)
Model checking
[Geilan et al] in Eindhoven
Univ.
Reduced the problem to a modelchecking problem on the state-space
of SDF graph
Etc. (MBRO, PAPS,
MRSP, …)
[Govindarajan et al] in Gao
Group, [Peperstraete et al],
[Goddard et al], [Ade et al] in
GRAPE group
Rate-optimal / Vectorization/
Application to real-time systems / etc
Jiwon Hahn, UC Irvine
30
Memory Optimization Techniques
1) *Scheduling w/ Unshared Buffer
2) *Buffer Sharing
3) *I/O Buffer Merging
4a) **Fractionizing
4b) Rate Selection (new)
5) Pipelining (new)
* Well established previous work
** Recently proposed
Jiwon Hahn, UC Irvine
31
Memory Optimization Techniques
▶ 1) Scheduling with unshared buffer
A
x
2
1
Schedule 1: A B B C C
x = A()
repeat 2:
y = B(x)
repeat 2:
C(y)


x[0..1] = A()
y[0] = B(x[0])
y[1] = B(x[1])
C(y[0])
C(y[1])
B
y
1
1
C
Schedule 2: A B C B C
x = A()
repeat 2:
y = B(x)
C(y)
x[0..1] = A()
y[0] = B(x[0])
C(y[0])
y[0] = B(x[1])
C(y[0])
Buffer requirement:
Buffer requirement:
|x| + |y| = 2 + 2 = 4
|a| + |b| = 2 + 1 = 3
By efficient ordering of actors, buffer requirement is reduced!
Each edge is directly mapped to its dedicated buffer space
Jiwon Hahn, UC Irvine
32
Memory Optimization Techniques
▶ Comparing 1), 2), 3)
x = A() Assuming the
Use the samerepeat 2: token is
y = B(x)
2 1
1 1 space for the
consumed
repeat
2:
input/output
before output is
Schedule:
A
B
B
C
C
C(y)
Reuse the
Data
tokens
produced…
available
consumed…
x[0..1] = A() space!x[0..1] = A()
x[0..1] = A()
y[0] = B(x[0])
y[0] = B(x[0])
x[0] = B(x[0])
y[1] = B(x[1])
x[1] = B(x[1])
x[0] = B(x[1])
C(y[0])
C(y[0])
C(x[0])
C(y[1])
C(x[0])
C(x[1])
A
x
B
y
C
1) Unshared Buffer
2) Shared Buffer
Buffer requirement:
Buffer requirement:
Buffer requirement:
|x| + |y| = 2 + 2 = 4
|x| + |y| = 2 + 1 = 3
|x| + |y| = 2 + 0 = 2
Jiwon Hahn, UC Irvine
3) Merged I/O Buffer
33
Memory Optimization Techniques
▶ Comparing 1), 2), 3) (cont’d)
1) Unshared Buffer
|x|+|y|
Mtotal
Mused
Mwasted
Mutil
local
buffer
x
2) Shared Buffer
t2 
t2 
t1 
t1 
y
:
:
:
:
:
t1 t3 t3 
A
Jiwon Hahn, UC Irvine
B
4
20
11
9
55%
3) Merged I/O Buffer
3
15
11
4
73%
2
10
9
1
90%
t2 
t2 t4t4 
t4 
t3 
t3 
B
C
t4 
C
time
34
Memory Optimization Techniques
▶ 4a) Fractionizing
 Idea: w
1
A
x
3
1
B
Schedule: A 3(B)
w
A’
x
1/3
1 1
Schedule: 2(AB)
B
 Don’t wait until A produces big chunk of data
 Modify actor A to process only fractional amount of the
original data at a time
 Trade-off
 Local effect
 Possible time and energy overhead
 e.g., resource’s access time, packet overhead
 Global effect
 Reduced bottleneck: shorter processing interval of A
 Reduced buffer size: min|x|: 2  1
Jiwon Hahn, UC Irvine
35
Memory Optimization Techniques
▶ 4b) Rate Selection
 Idea
w
 Generalize fractionizing (1,3)
A
x
(2,6) (4,4)
B
Schedule1: 2(A)B
Schedule2: AB
Schedule3: 2(A)3(B)
 Not only allow fractions but also multiples
 Rate is defined as range, but fixed before schedule finalizes
 Each actor is modeled with timing and power function with
respect to the I/O range
 Benefits
 Combines the power of flexibility and static determinism
 Increases buffer reduction opportunity
 Challenge
 Need an efficient way to handle considerably increased
exploration space at runtime
Jiwon Hahn, UC Irvine
36
Memory Optimization Techniques
▶ 5) Pipelining
 Idea
 Allow multiple actor firing at once
 Benefits
 Reduced buffer requirement
 Higher memory utilization
 Increased throughput
 Challenges
 Need multiprocessors
 Need to resolve resource conflict
 Need to consider synchronization problem
Jiwon Hahn, UC Irvine
37
Memory Optimization Techniques
▶ Comparing 1), 4), 5)
1) Unshared Buffer
x
t2 
t1 
1
t2 
A
1
1
C
t2 
Utilization: 66.7%  100%
t3 
t4 
t4 
 4 firing unit
B Time: 5 C
B
1/2
x
t1 
t1 
y
t4 
t2 
t2 
B
C
Jiwon Hahn, UC Irvine
1
t1 
4) Pipelined
5)
Fractionized / Rate Selected
CA
2
B
y
Buffer Size: 33% reduction
t3 
y
t2 
A
x
A’
t3 
A
C
x
1
B
1
y
1
1
C
t3 
t4 
t4 
B
C
38
Memory Optimization Techniques
▶ Summary
0
1
1+2
1+2+3
1+4
1+2+4
1+2+3+4
1+4+5
M_size
4
3
3
2
2
2
1
2
M_used
11
10
10
9
8
8
6
8
M_wasted
9
5
5
1
4
4
0
0
T
5
5
5
5
6
6
6
4
55%
66.7%
66.7%
90%
66.7%
66.7%
100%
100%
M_utilization
0: None (baseline) 1: Unshared Scheduling 2: Shared Buffer
3: Merged I/O
4: Fractionized
5: Pipelined
global
t1 
t1 
t3 
t3 
t41 
t1t2t2 
t2 
t43 
A
B
C
A
Jiwon Hahn, UC Irvine
t3  t4 
B
t4 
C
39
Multi-metric Optimization
 Trade-offs
 In actor point of view (local),
processing large amount of
data at once tends to reduce
time and energy overhead
 In SDF-flow point of view
(global), processing small
amount of data at once
reduces buffer requirement
Energy
Data
Memory
Execution
Time
 Goal
 Find a pareto-optimal point that
resides in a range of solution
set that satisfies constraints
Jiwon Hahn, UC Irvine
data-flow
rate
40
Applying it to Rappit
▶ Quasi-static optimization
Rappit Flow
Host
Compile-time
Compile
Kernel and primitives
compiled and installed
Load script
SDF defined
Preprocess
Actor-to-processor assignment,
Actor ordering (scheduling),
Buffer mapping
Load script code
Static schedule loaded
Execute
Deterministic execution
w/o runtime overhead
Run-time
Target
Jiwon Hahn, UC Irvine
Performed Tasks
Optimization
41
Outline
 Scripting Framework
 Memory-oriented Optimization
▶ Implementation
⊳ Synthesis Tool
⊳ Simulator
⊳ Runtime Host-assisting Tool (GUI)
 Experimental Platforms
 Summary & Research Plan
Jiwon Hahn, UC Irvine
42
Implementation
▶ Scripting engine synthesis tool
 System Template
 GUI-based check-box approach
 easily capture existing systems
 model new systems for simulation and design
space exploration
 includes communication description
 Component Library
 binds according to template configuration
 consists of MCU, on-chip devices, off-chip
peripherals
 each component has I/O pins and driver modules
Jiwon Hahn, UC Irvine
43
Implementation
▶ Memory simulator
Jiwon Hahn, UC Irvine
44
Implementation
▶ Interactive runtime tool
Jiwon Hahn, UC Irvine
45
Implementation
▶ Tool integration
Node 1
Parser
Dispatcher
GUI
Node 2
Scheduler
Memory
Optimizer
Node
Manager
Node 3
Node N
Jiwon Hahn, UC Irvine
46
Outline



▶

Scripting Framework
Memory-oriented Optimization
Implementation
Experimental Platforms
Summary & Research Plan
Jiwon Hahn, UC Irvine
47
HW Platforms and Real-world Applications
 Eco
 ultra-compact sensor node
 pre-term infant monitoring
 dancing motion detection
 Mini-FDPM
 active laser sensing device
 breast cancer detection
 DuraNode
 real-time data acquisition system
 structural health monitoring
 Butterfly
 low-power, i/o rich development board
 prototyping (SD-card, speaker, sensors, RF)
Jiwon Hahn, UC Irvine
48
Outline




▶
Scripting Framework
Memory-oriented Optimization
Implementation
Experimental Platforms
Summary & Research Plan
Jiwon Hahn, UC Irvine
49
Summary
 A novel scripting framework for embedded
systems
 Scripting engine synthesis
 Host assisting runtime environment
 Memory optimization techniques
 Comparison of techniques
 Integration and multi-objective problem
 Tool Implementations
 Rappit GUI, memory simulator
Jiwon Hahn, UC Irvine
50
Contributions
 Empowered Embedded Systems
 Unleashing the severely constrained embedded
systems
 SDF Extensions
 Extension of SDF model
 Extending the application area of SDF
 Memory Savings
 Reduced memory requirement by integration of
policies, including new techniques
Jiwon Hahn, UC Irvine
51
Research Plan
▶ finished, ongoing, future work
 Framework
 Language definition*
 Initial implementation and
prototyping
 Component library
generation*
 Code generation
 Overhead analysis
 Tool integration
 Test on multinode scenario
 Optimization
Survey and comparison
Simulator implementation
Integrating techniques
SDF extension on rate
Rate-selection algorithm
Buffer-mapping protocol
Cost function modeling of
multi-metric optimization
 SDF extension on timing







 Case Study
*with Qiang Xie & Jinfeng Liu
Jiwon Hahn, UC Irvine




AVR butterfly
mini-FDPM
eco
DuraNode
52
Publications
 Jiwon Hahn, Qiang Xie, and Pai H. Chou, Rappit: A
Framework for the Synthesis of Host-Assisted LightWeight Scripting Engines for Adaptive Embedded
Systems, in Proc. International Conference on
Hardware Software Codesign and System Synthesis
(CODES+ISSS), 2005.
 Jiwon Hahn, Dexin Li, Qiang Xie, Pai H. Chou, Nader
Bagherzadeh, David W. Jensen, Alan C. Tribble, Power
Reduction in JTRS Radios with ImpacctPro," in Proc.
IEEE Military Communication Conference (MILCOM),
2004.
Jiwon Hahn, UC Irvine
53
Bibliography











Murthy PK, Shuvra S. Bhattacharyya, Buffer merging - a powerful technique for reducing
memory requirements of synchronous dataflow specifications. ACM Transactions on Design
Automation of Electronic Systems (TODAES), 2004.
Murthy PK, Shuvra S. Bhattacharyya, Shared buffer implementations of signal processing
systems using lifetime analysis techniques, IEEE Transactions on Computer-Aided Design of
Integrated Circuits & Systems (TCADICS), 2001.
Shuvra S. Bhattacharyya., Murthy PK, Edward A. Lee, APGAN and RPMC: Complementary
Heuristics for Translating DSP Block Diagrams into Efficient Software Implementations, Design
Automation for Embedded Systems (DAES), 1997
Shuvra S. Bhattacharyya, Murthy PK, Edward A. Lee, Joint Minimization of Code and Data for
Synchronous Dataflow Programs, 1997.
Hyunok Oh, Soonhoi Ha, Fractional rate dataflow model and efficient code synthesis for
multimedia applications, SIGPLAN Not, 2002.
Hyunok Oh, Soonhoi Ha, Data memory minimization by sharing large size buffers, Asia and
South Pacific Design Automation Conference (ASPDAC), 2000.
Hyunok Oh, Soonhoi Ha, Efficient Code synthesis from extended dataflow graphs for multimedia
applications, Design Automation Conference (DAC), 2002.
Geilen M, Basten T, Stuijk S, Minimising buffer requirements of synchronous dataflow graphs
with model checking, 42nd Design Automation Conference (DAC), 2005.
Eckart Zitzler and Jurgen Teich and Shuvra S. Bhattacharyya, Multidimensional Exploration of
Software Implementations for DSP Algorithms, Journal of VLSI Signal Processing (JVLSI), 1999
John K. Ousterhout, Scripting: Higher Level Programming for the 21st Century, IEEE Computer
magazine, 1998
TecO Home, http://particle.teco.edu/
Jiwon Hahn, UC Irvine
54
Acknowledgements
 This work is sponsored in part by the National
Science Foundation grant CCR-0205712 and
NSF CAREER Award CNS-0448668
 Professor Pai Chou
 Qiang Xie
 Jinfeng Liu
Jiwon Hahn, UC Irvine
55
Backup Slides
Jiwon Hahn, UC Irvine
56
Scripting Overhead
 Scripting for General Purpose Computers
 Assume unlimited resources
 Full feature scripting engine for convenience
 Slower than system programming language
 Scripting for Embedded Systems
 Limited memory, CPU, power, …
 Need scripting engine optimization




Host assist
Language subsetting
Library subsetting
Efficient memory usage
 Scripting may be even faster than compiled code!
Jiwon Hahn, UC Irvine
57
Rappit
▶ Packet format example
 Command Packet Format
Dst.
Msg ID
Opcode
Input[3]
Output[3]
CRC
Command Message Format
Opcode
In_addr
In_start
In_size
Out_addr
Out_start
Out_size
 Response Packet Format
Src.
Msg ID
Msg Type
Data Type
Payload
CRC
EOP
Response Message Format
Jiwon Hahn, UC Irvine
58
Rappit
▶ Scripting engine optimization in code synthesis
 Language subsetting
 eg., assignment (=), loop (repeat)
 Library subsetting
 customized for target applications and platform
Full-Featured
Component Library
MCU
Interrupts RF
RF
Dataflash
SPI UART GPIO Interrupts ADC
LCD
Jiwon Hahn, UC Irvine
Joystick
Sensor1 Sensor2
GPIO
ADC
UART
Sensor1
59
Memory Organizations
▶ Comparing previous work and Rappit
 Previous approaches consider both data and code memory
minimization, but prioritize code size*
 We mainly focus on data size** minimization
On-chip Flash
or EEPROM
RAM
RAM
Application
Code*
Buffer
Previous work
Jiwon Hahn, UC Irvine
On-chip Flash
or EEPROM
Script Code
Primitives
Buffer **
Rappit
Kernel
Data Flash
Our work
60
Rappit
▶ Code size of runtime components
Host Code (.py)
Lines
Size (KB)
MCU Code (.c)
Lines
GUI
644
21.8
Cmd
127
Parser &
Msg Generator
2.87
Interpreter
260
-
221
4.97
Primitives
90
-
Library
263
6.396
300
-
Packetizer &
Depacketizer
82
2.0
Packetizer &
Depacketizer
Total
750
1.484
Packet Mgr
42
0.92
Total
1379
38.96
Jiwon Hahn, UC Irvine
Size
(KB)
61
Rappit
▶ Summary of results
 Code size reduction

Application
Native
Rappit
Reduction
Reg setting
4.356 KB
1.664 KB
61.8%
LCD usage
12.45 KB
4.2 KB
66.3%
Performance overhead components analysis
Native
Interactive
Batch
Communication
1
3
1
RAM Access
3
1
1
ROM Access
3
1
1
1: fast
Packetization
1
2
2
2: tolerable
Interpretation
1
2
2
3: slow
Total cmd/sec
92
4.75
111
Jiwon Hahn, UC Irvine
(bottleneck)
62
Rappit
▶ Subset of primitives
Device
Primitive
Device
Primitive
Device
Primitive
MCU
reset
GPIO
set pin
Timer
register fcn
MCU
power save
GPIO
get pin
Timer
remove fcn
MCU
initialize
GPIO
clear pin
RTC
set clock
MCU
get sys clock
USART
TX
RTC
read clock
MCU
set sys clock
USART
RX
LCD
clear
RF
INIT
SD
read
LCD
write
RF
set channel
SD
write
LCD
set contrast
RF
set power
ADC
read
Joystick
get key
RF
set frequency
Sensor1
read
Speaker
set volume
RF
send
Sensor2
read
Speaker
play tone
RF
receive
Sensor3
read
Speaker
play song
Jiwon Hahn, UC Irvine
63
Rappit
▶ Language
key
Usage
Example
import
import methods of each device
from RF import *
doc, dict
look up documentation, included
methods
RF.__doc__
RF.__dict__
open, close
open/close a connection to a target
system
node1 = open(MCU1, uart1)
node1.close()
ls
list all connected instances
ls
every,
start, stop
schedule events with certain period
s1 = (every 30ms: a+=
ADC1.read()); s1.start();
s1.stop()
repeat
looping
def
define of a function with a series of
methods
repeat 3:
SD.write(a)
def readTemperature():
...
=, +
assign/configure or add value
a = SD.read(10); a+=SD.read(20)
Jiwon Hahn, UC Irvine
64
SDF
▶ Strength and limitations
 Strength
 Ability to express multi-rate systems, parallelism
 Deadlock detection and scheduling can be determined at
compile-time
 Bounded memory requirements
 No runtime supervisory overhead
 Limitations
 Lack of conditional control flow
 Does not model asynchronous nodes
 Does not adequately address the real-time nature of
connections to the outside world
 Does not address data-dependent run times
Jiwon Hahn, UC Irvine
65
Superset of SDF
▶ Dynamic dataflow (DDF)
 Allows asynchronous actors with non-fixed rate
of each actor
 Captures dynamic constructs




if/else
for-loop
do/while loop
recursion
Jiwon Hahn, UC Irvine
66
SDF
▶ Notations
 Firing & Tokens






f(n) : nth firing vector
tk(n) : number of live tokens after nth firing
tk(n+1) = tk(n) + G · f(n)
f = n=0T f(n) : firing frequency
q = fmin : firing vector (minimum # of firings)
q(src(ei)) x p(src(ei)) = q(snk(ei)) x c(snk(ei))  balance equation
 Consistent SDF
 rank (G) = |N|-1
 G·q=0
 Scheduling
 Given G, tk(0), and q, find a firing order which satisfies tk(n) >= 0,
and q = n=0T f(n)
 Deadlocked if no node can be fired before reaching q = n=0T f(n)
Jiwon Hahn, UC Irvine
67
SDF
▶ Our extensions
 SDF previously used in multimedia-oriented
applications targeting DSPs and FPGAs
 To target more general types of applications,
non-buffered edges (dummy channels) should
be added, which only denotes precedence
 The produce/consume rate of each actor is not
given as fixed, but as a range
 Add timing (future work)
Jiwon Hahn, UC Irvine
68
SDF
▶ Another example
 Extended Surge Application
C
A
ADC
read
a
1 10
c
B
b
SD
store
D
d
1 3
SD
read
E
e
10
Kernel
1 pack 1
F
f
1
RF
send
LCD
show
 Valid Schedules:
 30(A) 3(B) 3(C) D 10(E) 10(F)
 3 (10(A) BC) D 10(EF)
 30(A) 2(BC) BCD 10(EF)
Jiwon Hahn, UC Irvine
– Flat SAS
– SAS
– Non SAS
69
SDF
▶ Another example (cont’d)
 Script (SAS)
enable Timer1, RF, SD, LCD
every 2048:
repeat 10:
repeat 10:
a = ADC.read()
LCD.show(a)
SD.store(a)
repeat 10:
b = SD.read()
repeat 3:
c = Kernel.pack(b)
RF.send(c)
Jiwon Hahn, UC Irvine
70
Script-to-SDF Transform
User script
x = A()
repeat 2:
y = B(x)
C(y)
V = { A, B, C }
E = { x, y } = {eAB, eBC}
πinit = A2(BC)
eAB p (A) = (2, 3)
c (B) = (1,1)
eBC p (B) = (1,1)
c (C) = (1,2)
Jiwon Hahn, UC Irvine
A
x
2/3 1/1
1/2
B
y
1/1
C
71
Multimetric Optimization
▶ Cost function modeling
 Constraints
 Energy
 Battery lifetime or other source of power budget
 Time
 Deadline in given real-time application
 Memory
 Given memory size for a platform
 Each node is modeled with:
 Pv(c,p): power consumption w.r.t. consume/produce rate (i.e.,
input/output data size)
 Tv(c,p): execution delay w.r.t. consume/produce rate
Jiwon Hahn, UC Irvine
72

Slide 1

Transcript Slide 1

Directory