presentation
Download
Report
Transcript presentation
STRUCTURED CODESIGN
FOR MANYCORE SYSTEMS
Jürg Gutknecht & Lisa (Ling) Liu, ETH Zürich
Sofsem Novy Smokovec, January 2011
About Me
1968 System programming
at Swissair
1977 PhD in Mathematics
1981 Joined Niklaus
Wirth's Lilith/ Modula team
1985 Sabbatial stay at
Xerox PARC
1986 Project Oberon
together with Wirth
2000 Academic languages
researcher at MSR
Outline of Talk
Context & Vision
A Structured Approach
Use Cases
Programming Language & Compiler
Power Management Codesign
Hardware Library
Context & Vision
Some context of the project and a vision
Microsoft Innovation Cluster
Launched in 2008 by Microsoft (Reseach)
Volume 5 years/ $5 mio
Theme embedded systems software
Participants
„Supercomputer
ETH Zürich (3 projects)
EPFL Lausanne (4 projects)
in the pocket“ is one
among them
Goals
Research in embedded systems
Technology transfer
Education
Supercomputer in the Pocket
Manycore architecture for embedded systems on
the basis of programmable hardware (FPGA)
High-performance computing in the small
Generic technology for wide range of apps
Sensor
will be focussed in this talk
driven medical IT
Data streaming in financial apps
Running robot with limb control
Real time audio processing
Hardware/ software design from the ground up
People Involved
Microsoft Research
Chuck
Thacker (consultant)
ETH Zürich
Niklaus
Wirth (processor design)
Jürg Gutknecht (project leader)
Lisa (Ling) Liu (hardware design)
Felix Friedrich (compiler)
University Hospital Basel
Alexej
Morozow (medical IT app)
The Vision
Custom hardware design for embedded systems
Programmers need no hardware knowledge
System design process at high level of abstraction
Fully automated mapping process to FPGA
FPGA resources are used efficiently
Semantic Gap
Program Constructs
Object
Thread
Data structure
Statement
Communication
I/O
...
FPGA Resources
Map
Lookup tables (LUT)
Block RAMs (BRAM),
DSP slices
…
An Structured Approach
Big picture of our structured codesign approach
Options for How to Achieve It
Hardware compilation: Custom mapping of specific
algorithm (or hot spots) to hardware circuits.
Uniprocessor: Single universal processor plus on-chip
cache memory. Transparently connected to external
memory.
SMP: Several universal processors, each with on-chip
cache memory, and each transparently connected to
external memory. Cache coherence mechanism needed.
Preconfigured: Several universal processors, each with
private on-chip memory. Interconnected via on-chip
network. One processor connected to external memory.
A Better Approach
Hardware/ software codesign based on a suitable
high-level computing model and programming
language
Fully automated mapping/ synthesizing to FPGA
hardware based on suitable library of highly
configurable hardware components
Our Computing Model
Active Cell (Actor)
Object
with private state space
Behavior control thread
Communicating with other actors via channels
Actor Graph
Collection
of interoperating actors running in parallel
Some actors connected to I/O via serial port
Our Hardware Library
TRM processor (Tiny Register Machine)
Extremely
simple
Two level pipelined instruction execution
Several variants
VTRM
(vectors via DSP), DTRM (DMA)
Communication FIFO
Ring buffer
Sizes 32, 64, 128, 1024
I/O controllers
DDR2, CF, LCD, UART
Mapping
Actor Graph
FPGA
Actor
Map
Communication
channel
I/ O
TRM processor („core“)
Instruction memory
Data memory
FIFO buffer
I/ O controllers
connected to cores
TRM/ FIFO Cooperation
channel
FIFO
recv
TRM
M
•fully orchestrated by TRM
•no interrupts!
send
FIFO
channel
Use Cases
Two data driven applications of our system
Realtime Multichannel ECG Monitor
Analyze the activity of the heart, the morphology of
the corresponding waves, and the heart rate
variability (HRV), with the aim of detecting and
classifying potential anomalies
The signal to be analyzed decomposes into 8
physical channels, each of them sampled at 500 Hz
Decomposition into Actor Graph
Wave
proc_1
Signal
input
ECG
bitstream
Wave
proc_2
Wave
proc_8
QRS
detect
HRV
analysis
Disease
classifier
out
stream
Actions
Receive ECG signal from UART, compose individual
samples, and distribute them to channel processors.
(Per channel): Precondition wave by suppressing noise
via linear filtering; Detect the heart beats and
contractions.
Detect QRS patterns and make a final decision about
heart rate on the basis of standard multichannel logic.
Analyze the current heart rhythm and the heart rate
variability (HRV).
Use decision tree logic to detect and classify arrhythmia
events such as premature ventricular contractions (PVC),
ventricular tachycardia etc. Feed results back to
configure wave processing.
Xilinx Virtex-5 FPGA
Development board
FIFO20
FIFO1
TRM
2
ECG
RS
232
FIFO9
Resulting
FPGA
configuration
TRM
3
UART
Ctrl
TRM
4
TRM
1
FIFO19
TRM
10
TRM
11
FIFO17
TRM
9
FIFO8
FIFO16
FIFO33
FIFO34
CF
Ctrl
CF
LCD
Ctrl
LCD
TRM
12
FIFO18
Use of Resources
ECG Monitor
#TRM
12
#LUT
13859
(48%)
#BRAM #DSP
52
12
(86%) (25%)
TRM load@116 MHz
< 10%
Maximum number of TRMs in communication chain
FPGA
#TRM
#LUT
#BRAM
#DSP
Virtex-5
30
27692
(96%)
60
(100%)
30
(62%)
Virtex-6
500
Preconfigured Version
Column 0
TRM
1
TRM
7
TRM
2
TRM
8
TRM
3
TRM
9
outbound arbiter
inbound arbiter
outbound arbiter
Column 2
inbound arbiter
H0
H1
H2
H3
inbound arbiter
inbound arbiter
outbound arbiter
outbound arbiter
Column 1
UART
controller
RS232
TRM
4
TRM
10
TRM
5
TRM
11
LCD
controller
LCD
TRM
12
CF
controller
CF
TRM
6
Column 3
Virtex-5LX50T FPGA
Xilinx ML505 board
ECG
Sensor
Comparative Power Usage
Preconfigured FPGA (TRM, IM/ DM, I/O,
interconnect)
Fully configurable
System
Quiescent
power (W)
Dynamic
power (W)
Preconfigured
3.43823
0.58988
Dynamically
configured
0.49742
0.48060
86% saving!
Graphics Based Motion Detection
Problem: Detect moving objects in a series of image
frames
Approach: Parallelize detection process by domain
decomposition (into 4 parts)
Design: A reader process continuously reads frames
from external memory and forwards them to (4)
part-detection processes running in parallel and
reporting detected movements
FPGA Configuration
Performance Results
Data base
10
frames of resolution 576 x 768 (432 KP)
Estimated performance
Transfer
from external DDR2 memory ca. 40 MP/sec
Computation: 4 x 31 MP/sec
Total time used per frame 55 ms
Total throughput 18 frames/ sec
Program Language & Compiler
Programming language & automated mapping
The ActiveCells Language
History & Profile
Evolution
of Pascal, Modula, Oberon
Actor based
Compositional
Active cell (Actor)
Object
with active behavior, communicating via channels
Assembly
Network
of interoperating active cells
Reusable software component with ports interface
Example of Functional Actor
F = actor (in1, in2: instr; out: outstr);
var i, j: integer;
begin
loop
recv(in1, i); recv(in2, j);
send(out, someOp(i, j))
end
end
Example of User Interface Actor
UI = actor (out1, out2: outstr; in: instr);
var i, j, k: INTEGER;
begin
loop
RS232.RecvInt(i); RS232.RecvInt(j);
send(out1, i); send(out2, j);
recv(in, k);
RS232.SendInt(k)
end
end
Examples of Assemblies
Assembly without ports
A
connect
in
RS232
UI
out1 out2
out
F
in1 in2
Assembly with ports
out
B
delegate
actor
out
G
in1 in2
out
F
in1 in2
out
F
in1 in2
in1
in3
in2
in4
Assembly A Code
assembly A; (*without ports*)
import RS232;
type
F = actor (in1, in2: instr; out: outstr);
UI = actor (out1, out2: outstr; in: instr);
var ifc: UI; f: F;
begin new(ifc); new(f);
connect(ifc.out1, f.in1); connect(ifc.out2, f.in2);
connect(f.out, ifc.in)
end A.
Assembly B Code
Assembly B (in1, in2, in3, in4: instr; out: outstr);
(*with five ports*)
type F, G = actor (in1, in2: instr; out: outstr);
var f1, f2: F; g: G;
begin new(f1); new(f2); new(g);
connect(f1.out, g.in1); connect(f2.out2, g.in2);
delegate(in1, f1.in1); delegate(in2, f1.in2);
delegate(in3, f2.in1); delegate(in4, f2.in2);
delegate(out, g.out)
end B.
Built-In Vector Types and Operators
Runge-Kutta (x, x1, k1, k2, … 3d vectors)
while
t <= tmax do
k1 := f(t, x);
k2 := f(t + dt/2, x + dt/2 * k1);
k3 := f(t + dt/2, x + dt/2 * k2);
k4 := f(t + dt, x + dt * k3);
x1 := x + dt/3 * (1/2 * k1 + k2 + k3 + 1/2 * k4);
Draw(x, x1);
x := x1; t := t + dt;
end
Built-In Matrix Types and Operators
Graphics pipeline (Matrix multiplication)
M
:= Graphics.Proj(left, right, bot, top, near, far)
* Graphics.Trans(0.0, 0.0, -d)
* Graphics.RotX(elev)
* Graphics.RotY(-azim)
* Graphics.Trans(0.0, 0.0,- zm)
Hybrid Compilation
Code body Role
Actor
Business logic
Assembly
Compilation method
Software compilation
(TRM/ DSP)
Creating actor Hardware compilation
graph (wiring) (Verilog)
Actor Code
F = actor (in1, in2: instr; out: outstr);
var i, j: integer;
begin
loop
recv(in1, i); recv(in2, j);
send(out, someOp(i, j))
end
end
Assembly Code
assembly B (in1, in2, in3, in4: instr;
out: outstr);
type F, G = actor (in1, in2: instr; out: outstr);
var f1, f2: F; g: G;
begin new(f1); new(f2); new(g);
connect(f1.out, g.in1); connect(f2.out2, g.in2);
delegate(in1, f1.in1); delegate(in2, f1.in2);
delegate(in3, f2.in1); delegate(in4, f2.in2);
delegate(out, g.out)
end B.
Automated Mapping to FPGA
source program
TRM
code
memory images
.mem
hybrid
compiler
Verilog code
Xilinx
synthesizer
bits
runtime
library
scripts make.tcl,
ram.bmm
hardware
library
Program Model Refinement
Each thread may spawn any number mutually
independent sub-threads
Advantages
Allows
(lock-free) fine-grained parallel computing
Requirements
spawn
Needs
core clustering
Needs runtime scheduling support
Needs barrier mechanism
barrier
A1
A
A1
A2
Next Step
Use the ActiveCells language for developing
embedded software on top of some standard IDE
Including
design, programming, debugging, analyzing
Analyzer may need cycle accurate simulator
Use fully automated tool to generate an FPGA
image
burn
down
Power Management Codesign
Integrated HW/SW power management system
Collaboration with Prof. Shiao-Li Tsao, National
Chiao Tung University, Taiwan
Perfomance/ Energy Space
P/ E Profiling
Clock Gating Strategy
with clock always on
with clock gating
Power Management as Add-On
Clock gating
PM Add-On generated automatically on demand
actor
data
clk
{ PM } (...);
TRM
PM
Add-On
Circuitry
out
in
•Instruction
•clockOff()
•Control registers
•TRM mode, clock rate, voltage
•Signals
•Data on port
•I/O ports
•Interop with PM controller
•Internal memory
•backup TRM state/ registers
Clock Gating Off Procedure
data
clk
Clock
Manager
clk
TRM
PM Add- signal PM controller
On
Circuitry
out
in
PM
Controller
stop clock
Clock Gating On Procedure
Data arrives
data
clk
Clock
Manager
clk
PM AddOn
Circuitry
out
in
PM
Controller
TRM
processor resumes
PM controller feeds in clock
SW Add-on Enhancements
Conditional compilation of (blocking) recv statement
recv(in,
a) without { PM } option
repeat
recv(in,
until nonblockingRecv(in, a);
a) with { PM } option
resetTimer(shortTime);
repeat dataAvailable := nonblockingRecv(in, a)
until timerExpired() or dataAvailable;
stopTimer();
if ~dataAvailable then clockOff() end
Next Step for Real Time Software
begin { T } ... (* statements *) end
Adjust
idle/ busy periods or clock rate between begin
... end to just meet indicated time limit T
Hardware Library
Bridge the semantic gap between software
functions and hardware circuitry
Motivation
Allow automatic generating tailored hardware for a
given stream application
The semantic gap between application model and
hardware circuitry is too big
An
abstraction of hardware circuitry is required to
bridge the gap
A clear classification of hardware components is
required to achieve efficient mapping with regards to
resource, performance and energy
Hardware Components Classification
Computation Components
• General purpose minimal
machine: TRM
• Vector machine: VTRM
Communication Components
• FIFOs
• 32 * 128
• 512 * 128
• 32, 64, 128, 1k * 32
Storage Components
• DMA + TRM: DTRM
• direct transfer vector
from DDR to VTRM
I/O Components
• TRM + I/O access: IOTRM
• packing/unpacking I/O
data to vectors or words
Abstraction
Hardware interfaces
Computation components
#(IMB, DMB) TRM (input clk, rst, irq0, irq1, input[31:0] inbus,
output[5:0] ioadr, output iowr, iord,
output[31:0] outbus)
#(VL, IMB) VTRM (input clk, rst, input[VL*32-1:0] inbus,
output[5:0] ioadr, output iowr, iord,
output[VL*32-1:0] outbus)
Communication components
#(Width, Depth) ParChannel (input clk, rst, input[Width-1:0] inData,
input wreq, rdreq,
output[Width-1:0] outData,
output[31:0] status)
Storage component
#(DataWidth) DTRM (input clk, rst,
input[DataWidth-1:0] inbus,
output[5:0] ioadr, output iowr, iord,
output[DataWidth-1:0] outbus)
IO component
#(VL) IOTRM (input clk, rst,
input [VL*32-1:0] inbus,
output [5:0] ioadr, output iowr, iord,
output[VL*32-1:0] outbus)
TRM (Tiny Register Machine)
2-address register machine (8 registers)
Configurable instruction/ data memory
Optional I/O controller added
DMemory
(1K x 32 bits)
116 MHz
IMemory
(4K x 18 bits)
18
Decoder
32
Registers
ALU
Vector TRM
8 vector registers (each 8 32-bit floats)
Vector add/ multiply takes 4 cycles
Horizontal addition takes 10 cycles
TRM
IMemory
(4K x 18 bits)
DMemory
(8K x 32 bits)
Vector
256
256
DMA TRM
256 bits wide data bus
Loading 256 bits from DMA takes 2 cycles
Storing 256 bits to DMA takes 1 cycle
I/O data bus
256
IMemory
(4K x 18 bits)
TRM
256
DMA
DMemory
(1K x 32 bits)
Area, Performance Features
(on Virtex-5LX50T)
System clock speed: 116MHz
TRM : 2% LUTs, 1 DSP, 5 cycles for multiplication
VTRM
integer vector unit, VL=4: 8% LUTs, 8 DSPs,
5 cycles for Vector multiplication, 3 cycles for horizontal vector addition
Floating point vection unit, VL = 4: 18% LUTs, 9 DSPs
DMA: 10% LUTs, 1 DSP, 2 cycles for loading a block from
DDR2 controller buffer, 1 cycle for writing a block into DDR2
controller buffer
IOTRM: 5% LUTs, 1 DSP, 2 cycles for loading a vector, 1 cycle
for writing a vector
References
http://www.nativesystems.inf.ethz.ch/
Reference papers
Ling
Liu, Oleksii Morozov, A Process-Oriented
Streaming System Design Paradigm for FPGAs,
Reconfig’2010, Cancun, Mexico, December 13-15,
2010.
Ling Liu, Oleksii Morozov, Yuxing Han, Jürg Gutknecht,
Patrick Hunziker, Automatic SoC Design Flow on Manycore Processors: a Software Hardware Co-Design
Approach for FPGAs, FPGA’2011, Monterey California,
February 27 ~ March 1, 2011.
Reserve Slides
Program Model Refinement 2
Separate agent thread for each communication
Each actor running one main thread (behavior) and
several communication threads (agents) under
mutual exclusion
communication
Advantages
Stateful
dialogs
No deadlocks
Requirements
Fast
c
behavior
context switches
X
Y
X
Wiring Integrated into Actors
module M;
var x1, x2: X;
y: Y;
type
X = object … end X;
Y = object … end Y;
begin
new(y);
new(x1, y);
new (x2, y)
end M.
X = object
var c: Y.C;
activity A;
var i, j, k: integer;
begin (*behave*)
…; c(i, j); …; c(k); …
end A;
procedure X (y: Y);
begin (*build object*)
…; new (c); …
end X;
begin new A
(*launch behavior*)
end X;
Y = object
activity A;
begin (*behave*) …
end A;
activity C;
var u, v, w: integer;
begin (*communicate*)
…; accept(u, v);
…; accept(w); …
end C;
procedure Y;
begin (*construct*) …
end Y;
begin new A
end Y;