presentation

Transcript presentation

STRUCTURED CODESIGN
FOR MANYCORE SYSTEMS
Jürg Gutknecht & Lisa (Ling) Liu, ETH Zürich
Sofsem Novy Smokovec, January 2011
About Me






1968 System programming
at Swissair
1977 PhD in Mathematics
1981 Joined Niklaus
Wirth's Lilith/ Modula team
1985 Sabbatial stay at
Xerox PARC
1986 Project Oberon
together with Wirth
2000 Academic languages
researcher at MSR
Outline of Talk






Context & Vision
A Structured Approach
Use Cases
Programming Language & Compiler
Power Management Codesign
Hardware Library
Context & Vision
Some context of the project and a vision
Microsoft Innovation Cluster




Launched in 2008 by Microsoft (Reseach)
Volume 5 years/ $5 mio
Theme embedded systems software
Participants
„Supercomputer
ETH Zürich (3 projects)
 EPFL Lausanne (4 projects)


in the pocket“ is one
among them
Goals
Research in embedded systems
 Technology transfer
 Education

Supercomputer in the Pocket



Manycore architecture for embedded systems on
the basis of programmable hardware (FPGA)
High-performance computing in the small
Generic technology for wide range of apps
 Sensor
will be focussed in this talk
driven medical IT
 Data streaming in financial apps
 Running robot with limb control
 Real time audio processing

Hardware/ software design from the ground up
People Involved

Microsoft Research
 Chuck

Thacker (consultant)
ETH Zürich
 Niklaus
Wirth (processor design)
 Jürg Gutknecht (project leader)
 Lisa (Ling) Liu (hardware design)
 Felix Friedrich (compiler)

University Hospital Basel
 Alexej
Morozow (medical IT app)
The Vision





Custom hardware design for embedded systems
Programmers need no hardware knowledge
System design process at high level of abstraction
Fully automated mapping process to FPGA
FPGA resources are used efficiently
Semantic Gap
Program Constructs







Object
Thread
Data structure
Statement
Communication
I/O
...
FPGA Resources

Map



Lookup tables (LUT)
Block RAMs (BRAM),
DSP slices
…
An Structured Approach
Big picture of our structured codesign approach
Options for How to Achieve It




Hardware compilation: Custom mapping of specific
algorithm (or hot spots) to hardware circuits.
Uniprocessor: Single universal processor plus on-chip
cache memory. Transparently connected to external
memory.
SMP: Several universal processors, each with on-chip
cache memory, and each transparently connected to
external memory. Cache coherence mechanism needed.
Preconfigured: Several universal processors, each with
private on-chip memory. Interconnected via on-chip
network. One processor connected to external memory.
A Better Approach


Hardware/ software codesign based on a suitable
high-level computing model and programming
language
Fully automated mapping/ synthesizing to FPGA
hardware based on suitable library of highly
configurable hardware components
Our Computing Model

Active Cell (Actor)
 Object
with private state space
 Behavior control thread
 Communicating with other actors via channels

Actor Graph
 Collection
of interoperating actors running in parallel
 Some actors connected to I/O via serial port
Our Hardware Library

TRM processor (Tiny Register Machine)
 Extremely
simple
 Two level pipelined instruction execution
 Several variants
 VTRM

(vectors via DSP), DTRM (DMA)
Communication FIFO
Ring buffer
 Sizes 32, 64, 128, 1024


I/O controllers

DDR2, CF, LCD, UART
Mapping
Actor Graph

FPGA
Actor


Map


Communication
channel
I/ O



TRM processor („core“)
Instruction memory
Data memory
FIFO buffer
I/ O controllers
connected to cores
TRM/ FIFO Cooperation
channel
FIFO
recv
TRM
M
•fully orchestrated by TRM
•no interrupts!
send
FIFO
channel
Use Cases
Two data driven applications of our system
Realtime Multichannel ECG Monitor


Analyze the activity of the heart, the morphology of
the corresponding waves, and the heart rate
variability (HRV), with the aim of detecting and
classifying potential anomalies
The signal to be analyzed decomposes into 8
physical channels, each of them sampled at 500 Hz
Decomposition into Actor Graph
Wave
proc_1
Signal
input
ECG
bitstream
Wave
proc_2
Wave
proc_8
QRS
detect
HRV
analysis
Disease
classifier
out
stream
Actions





Receive ECG signal from UART, compose individual
samples, and distribute them to channel processors.
(Per channel): Precondition wave by suppressing noise
via linear filtering; Detect the heart beats and
contractions.
Detect QRS patterns and make a final decision about
heart rate on the basis of standard multichannel logic.
Analyze the current heart rhythm and the heart rate
variability (HRV).
Use decision tree logic to detect and classify arrhythmia
events such as premature ventricular contractions (PVC),
ventricular tachycardia etc. Feed results back to
configure wave processing.
Xilinx Virtex-5 FPGA
Development board
FIFO20
FIFO1
TRM
2
ECG
RS
232
FIFO9
Resulting
FPGA
configuration
TRM
3
UART
Ctrl
TRM
4
TRM
1
FIFO19
TRM
10
TRM
11
FIFO17
TRM
9
FIFO8
FIFO16
FIFO33
FIFO34
CF
Ctrl
CF
LCD
Ctrl
LCD
TRM
12
FIFO18
Use of Resources

ECG Monitor
#TRM
12

#LUT
13859
(48%)
#BRAM #DSP
52
12
(86%) (25%)
TRM load@116 MHz
< 10%
Maximum number of TRMs in communication chain
FPGA
#TRM
#LUT
#BRAM
#DSP
Virtex-5
30
27692
(96%)
60
(100%)
30
(62%)
Virtex-6
500
Preconfigured Version
Column 0
TRM
1
TRM
7
TRM
2
TRM
8
TRM
3
TRM
9
outbound arbiter
inbound arbiter
outbound arbiter
Column 2
inbound arbiter
H0
H1
H2
H3
inbound arbiter
inbound arbiter
outbound arbiter
outbound arbiter
Column 1
UART
controller
RS232
TRM
4
TRM
10
TRM
5
TRM
11
LCD
controller
LCD
TRM
12
CF
controller
CF
TRM
6
Column 3
Virtex-5LX50T FPGA
Xilinx ML505 board
ECG
Sensor
Comparative Power Usage


Preconfigured FPGA (TRM, IM/ DM, I/O,
interconnect)
Fully configurable
System
Quiescent
power (W)
Dynamic
power (W)
Preconfigured
3.43823
0.58988
Dynamically
configured
0.49742
0.48060
86% saving!
Graphics Based Motion Detection



Problem: Detect moving objects in a series of image
frames
Approach: Parallelize detection process by domain
decomposition (into 4 parts)
Design: A reader process continuously reads frames
from external memory and forwards them to (4)
part-detection processes running in parallel and
reporting detected movements
FPGA Configuration
Performance Results

Data base
 10

frames of resolution 576 x 768 (432 KP)
Estimated performance
 Transfer
from external DDR2 memory ca. 40 MP/sec
 Computation: 4 x 31 MP/sec
 Total time used per frame 55 ms
 Total throughput 18 frames/ sec
Program Language & Compiler
Programming language & automated mapping
The ActiveCells Language

History & Profile
 Evolution
of Pascal, Modula, Oberon
 Actor based
 Compositional

Active cell (Actor)
 Object

with active behavior, communicating via channels
Assembly
 Network
of interoperating active cells
 Reusable software component with ports interface
Example of Functional Actor

F = actor (in1, in2: instr; out: outstr);
var i, j: integer;
begin
loop
recv(in1, i); recv(in2, j);
send(out, someOp(i, j))
end
end
Example of User Interface Actor

UI = actor (out1, out2: outstr; in: instr);
var i, j, k: INTEGER;
begin
loop
RS232.RecvInt(i); RS232.RecvInt(j);
send(out1, i); send(out2, j);
recv(in, k);
RS232.SendInt(k)
end
end
Examples of Assemblies

Assembly without ports
A
connect
in
RS232
UI
out1 out2
out
F
in1 in2

Assembly with ports
out
B
delegate
actor
out
G
in1 in2
out
F
in1 in2
out
F
in1 in2
in1
in3
in2
in4
Assembly A Code

assembly A; (*without ports*)
import RS232;
type
F = actor (in1, in2: instr; out: outstr);
UI = actor (out1, out2: outstr; in: instr);
var ifc: UI; f: F;
begin new(ifc); new(f);
connect(ifc.out1, f.in1); connect(ifc.out2, f.in2);
connect(f.out, ifc.in)
end A.
Assembly B Code

Assembly B (in1, in2, in3, in4: instr; out: outstr);
(*with five ports*)
type F, G = actor (in1, in2: instr; out: outstr);
var f1, f2: F; g: G;
begin new(f1); new(f2); new(g);
connect(f1.out, g.in1); connect(f2.out2, g.in2);
delegate(in1, f1.in1); delegate(in2, f1.in2);
delegate(in3, f2.in1); delegate(in4, f2.in2);
delegate(out, g.out)
end B.
Built-In Vector Types and Operators

Runge-Kutta (x, x1, k1, k2, … 3d vectors)
 while
t <= tmax do
k1 := f(t, x);
k2 := f(t + dt/2, x + dt/2 * k1);
k3 := f(t + dt/2, x + dt/2 * k2);
k4 := f(t + dt, x + dt * k3);
x1 := x + dt/3 * (1/2 * k1 + k2 + k3 + 1/2 * k4);
Draw(x, x1);
x := x1; t := t + dt;
end
Built-In Matrix Types and Operators

Graphics pipeline (Matrix multiplication)
M
:= Graphics.Proj(left, right, bot, top, near, far)
* Graphics.Trans(0.0, 0.0, -d)
* Graphics.RotX(elev)
* Graphics.RotY(-azim)
* Graphics.Trans(0.0, 0.0,- zm)
Hybrid Compilation
Code body Role
Actor
Business logic
Assembly
Compilation method
Software compilation
(TRM/ DSP)
Creating actor Hardware compilation
graph (wiring) (Verilog)
Actor Code

F = actor (in1, in2: instr; out: outstr);
var i, j: integer;
begin
loop
recv(in1, i); recv(in2, j);
send(out, someOp(i, j))
end
end
Assembly Code

assembly B (in1, in2, in3, in4: instr;
out: outstr);
type F, G = actor (in1, in2: instr; out: outstr);
var f1, f2: F; g: G;
begin new(f1); new(f2); new(g);
connect(f1.out, g.in1); connect(f2.out2, g.in2);
delegate(in1, f1.in1); delegate(in2, f1.in2);
delegate(in3, f2.in1); delegate(in4, f2.in2);
delegate(out, g.out)
end B.
Automated Mapping to FPGA
source program
TRM
code
memory images
.mem
hybrid
compiler
Verilog code
Xilinx
synthesizer
bits
runtime
library
scripts make.tcl,
ram.bmm
hardware
library
Program Model Refinement


Each thread may spawn any number mutually
independent sub-threads
Advantages
 Allows

(lock-free) fine-grained parallel computing
Requirements
spawn
 Needs
core clustering
 Needs runtime scheduling support
 Needs barrier mechanism
barrier
A1
A
A1
A2
Next Step

Use the ActiveCells language for developing
embedded software on top of some standard IDE
 Including
design, programming, debugging, analyzing
 Analyzer may need cycle accurate simulator

Use fully automated tool to generate an FPGA
image
burn
down
Power Management Codesign
Integrated HW/SW power management system
Collaboration with Prof. Shiao-Li Tsao, National
Chiao Tung University, Taiwan
Perfomance/ Energy Space
P/ E Profiling
Clock Gating Strategy
with clock always on
with clock gating
Power Management as Add-On


Clock gating
PM Add-On generated automatically on demand
 actor
data
clk
{ PM } (...);
TRM
PM
Add-On
Circuitry
out
in
•Instruction
•clockOff()
•Control registers
•TRM mode, clock rate, voltage
•Signals
•Data on port
•I/O ports
•Interop with PM controller
•Internal memory
•backup TRM state/ registers
Clock Gating Off Procedure
data
clk
Clock
Manager
clk
TRM
PM Add- signal PM controller
On
Circuitry
out
in
PM
Controller
stop clock
Clock Gating On Procedure
 Data arrives
data
clk
Clock
Manager
clk
PM AddOn
Circuitry
out
in
PM
Controller
TRM
processor resumes
PM controller feeds in clock
SW Add-on Enhancements

Conditional compilation of (blocking) recv statement
 recv(in,
a) without { PM } option
 repeat
 recv(in,
until nonblockingRecv(in, a);
a) with { PM } option
 resetTimer(shortTime);
repeat dataAvailable := nonblockingRecv(in, a)
until timerExpired() or dataAvailable;
stopTimer();
if ~dataAvailable then clockOff() end
Next Step for Real Time Software

begin { T } ... (* statements *) end
 Adjust
idle/ busy periods or clock rate between begin
... end to just meet indicated time limit T
Hardware Library
Bridge the semantic gap between software
functions and hardware circuitry
Motivation


Allow automatic generating tailored hardware for a
given stream application
The semantic gap between application model and
hardware circuitry is too big
 An
abstraction of hardware circuitry is required to
bridge the gap
 A clear classification of hardware components is
required to achieve efficient mapping with regards to
resource, performance and energy
Hardware Components Classification
Computation Components
• General purpose minimal
machine: TRM
• Vector machine: VTRM
Communication Components
• FIFOs
• 32 * 128
• 512 * 128
• 32, 64, 128, 1k * 32
Storage Components
• DMA + TRM: DTRM
• direct transfer vector
from DDR to VTRM
I/O Components
• TRM + I/O access: IOTRM
• packing/unpacking I/O
data to vectors or words
Abstraction

Hardware interfaces

Computation components
#(IMB, DMB) TRM (input clk, rst, irq0, irq1, input[31:0] inbus,
output[5:0] ioadr, output iowr, iord,
output[31:0] outbus)
#(VL, IMB) VTRM (input clk, rst, input[VL*32-1:0] inbus,
output[5:0] ioadr, output iowr, iord,
output[VL*32-1:0] outbus)

Communication components
#(Width, Depth) ParChannel (input clk, rst, input[Width-1:0] inData,
input wreq, rdreq,
output[Width-1:0] outData,
output[31:0] status)

Storage component
#(DataWidth) DTRM (input clk, rst,
input[DataWidth-1:0] inbus,
output[5:0] ioadr, output iowr, iord,
output[DataWidth-1:0] outbus)

IO component
#(VL) IOTRM (input clk, rst,
input [VL*32-1:0] inbus,
output [5:0] ioadr, output iowr, iord,
output[VL*32-1:0] outbus)
TRM (Tiny Register Machine)



2-address register machine (8 registers)
Configurable instruction/ data memory
Optional I/O controller added
DMemory
(1K x 32 bits)
116 MHz
IMemory
(4K x 18 bits)
18
Decoder
32
Registers
ALU
Vector TRM



8 vector registers (each 8 32-bit floats)
Vector add/ multiply takes 4 cycles
Horizontal addition takes 10 cycles
TRM
IMemory
(4K x 18 bits)
DMemory
(8K x 32 bits)
Vector
256
256
DMA TRM



256 bits wide data bus
Loading 256 bits from DMA takes 2 cycles
Storing 256 bits to DMA takes 1 cycle
I/O data bus
256
IMemory
(4K x 18 bits)
TRM
256
DMA
DMemory
(1K x 32 bits)
Area, Performance Features
(on Virtex-5LX50T)



System clock speed: 116MHz
TRM : 2% LUTs, 1 DSP, 5 cycles for multiplication
VTRM




integer vector unit, VL=4: 8% LUTs, 8 DSPs,
5 cycles for Vector multiplication, 3 cycles for horizontal vector addition
Floating point vection unit, VL = 4: 18% LUTs, 9 DSPs
DMA: 10% LUTs, 1 DSP, 2 cycles for loading a block from
DDR2 controller buffer, 1 cycle for writing a block into DDR2
controller buffer
IOTRM: 5% LUTs, 1 DSP, 2 cycles for loading a vector, 1 cycle
for writing a vector
References


http://www.nativesystems.inf.ethz.ch/
Reference papers
 Ling
Liu, Oleksii Morozov, A Process-Oriented
Streaming System Design Paradigm for FPGAs,
Reconfig’2010, Cancun, Mexico, December 13-15,
2010.
 Ling Liu, Oleksii Morozov, Yuxing Han, Jürg Gutknecht,
Patrick Hunziker, Automatic SoC Design Flow on Manycore Processors: a Software Hardware Co-Design
Approach for FPGAs, FPGA’2011, Monterey California,
February 27 ~ March 1, 2011.
Reserve Slides
Program Model Refinement 2



Separate agent thread for each communication
Each actor running one main thread (behavior) and
several communication threads (agents) under
mutual exclusion
communication
Advantages
 Stateful
dialogs
 No deadlocks

Requirements
 Fast
c
behavior
context switches
X
Y
X
Wiring Integrated into Actors
module M;
var x1, x2: X;
y: Y;
type
X = object … end X;
Y = object … end Y;
begin
new(y);
new(x1, y);
new (x2, y)
end M.
X = object
var c: Y.C;
activity A;
var i, j, k: integer;
begin (*behave*)
…; c(i, j); …; c(k); …
end A;
procedure X (y: Y);
begin (*build object*)
…; new (c); …
end X;
begin new A
(*launch behavior*)
end X;
Y = object
activity A;
begin (*behave*) …
end A;
activity C;
var u, v, w: integer;
begin (*communicate*)
…; accept(u, v);
…; accept(w); …
end C;
procedure Y;
begin (*construct*) …
end Y;
begin new A
end Y;