No Slide Title

Transcript No Slide Title

Polymorphic Processors:
How to Expose Arbitrary Hardware Functionality to Programmers
Stamatis Vassiliadis
Computer Engineering,
EEMCS, TU Delft
http://ce.et.tudelft.nl
Member of HiPEAC
PACT ’04, Antibes, France
PZE and the Amdahl’s law
50%
program
…
20%
Max speedup = 2.0
Excluding start-up
reduced 5 cycles to 3 speedup 1.6
83% efficiency
B
ASIC
Very Large
The limitation:
Techniques:
• ILP
• pipeline
• technology
10X
Potential Zero Execution (PZE) introduced
in 87-88 and published in IBM Journal of R&D 94
Timewise we execute two instructions (50% code elimination)
2X
0.5
0.9
Why polymorphic? We can ride the Amdahl’s curve
easier and faster
PACT ’04, Antibes, France
Motivating example
Paeth
coding
Research questions:
• What does Paeth means in terms of computations?
• Can I put it on hardware?
• What is my gain?
predictive coding
Original
image
Filtered
image
decoding
Original
image
compression (ZIP)
bitstream
UNZIP
Filtered
image
Transmission:
bitstream

Goal: get image with more 0’s

Is it possible?: spatial redundancy
(adjacent pixels often have same
values => many differences
between them =0 )
PACT ’04, Antibes, France
Motivating example
c b
a d
Paeth(d)= one of a,b,c, which is closest
to initial prediction p = a+b-c
Original
Filtered
0
0
0
0
0
0
0
0
0
3
3
3
0
3
0
0
0
3
4
4
0
0
1
0
0
3
4
5
0
0
0
0
Filtered=Original-Paeth
=4 - 4 =0
Paeth
0
0
0
0
0
0
3
3
0
3
3
4
0
3
4
4
c=3, b=3
a=4, d=4
p =4+3-3=4
Paeth(d)=a=4
c
b
a
p=a+b-c
pa=|p-a|
pb=|p-b|
pc=|p-c|
pa<=pb?
pa<=pc?
pb<=pc?
1
0
1
1
Paeth
area:…………… 6 8-bit adders
PACT ’04, Antibes, France
0
0
Example: Paeth Prediction (PNG)
C-code
bptr = prev_row+1;
dptr = curr_row+1;
predptr= predict_row+1;
for(i=1; i < length; i++){
c = *(bptr-1);
b = *bptr;
... .... ...
if(...)
*predptr = a;
else if (..)
else *predptr = c;
......
bptr++;
}
What it does
Altivec code
li
r5, 0
….totally 6 instructions
loop:
lvx vr03, r1
lvx vr04, r2
vsidoi vr05, vr01, vr03, 1
vmrghb
vr07, vr03, vr00
vmrglb
vr08, vr03, vr00
…totally 6 instructions
#Compute
vadduhs
vr15, vr09, vr11
vadduhs
vr16, vr10, vr12
vsubshs
vr15, vr15, vr07
vsubshs
vr16, vr16, vr08
..totally 76 instructions
#Pack:
vpkshus
vr28, vr28, 29
#Store:
stvx
vr28, r3, 0
#Loop control
addi
r1, r1, 16
……..
bneq
r7, r0, loop
initialize
# load c's
# load a's
# load b's
CSI code
li r5, 1
csi_mt_scr r1, SCR1, 0
csi_mt_scr r5, SCR1, 1
..totally 20 instructions
load
# unpack
# unpack
unpack
# a+b
#
#
#
process
csi_paeth predptr, bptr,dptr
# pack
#
store
pack
store
# Loop
Looping
ONE INSTRUCTION
For all loop iterations
Altivec iteration: 95 instructions per 16 pixels.
 CSI code : 1 instruction for all iterations (+20 setup instructions)
CSI Instruction design : latency: …………. 5 cycles
throughput: ………16 pixels/1 cycle
( EUROMICRO 99 )
area:…………… 24 32-bit adders
Cycle = 1 ALU operation PACT ’04, Antibes, France

Results: Instruction count and
execution time reduction
Bench: Paeth kernel, 132-element vectors (132 pixels in a row)
Dynamic instruction counts,
normalised to non-CSI counts
Execution time: on 4-issue CPU,
with 32 byte-wide CSI unit,
normalised to non-CSI execution
PACT ’04, Antibes, France
Research Questions
Motivating example: Obvious observations
NO way I can do this on fixed hardware
I can do this if the hardware changes functionality at my wishes.
EASIER SAID THAN DONE !
I have to answer the following:
How can I identify the code for hardware implementation?
New kind of tools
How can I implement “arbitrary” code?
Microarchitecture
Is the hardwired code substituted by new instructions?
Processor architecture
(behavior + logical structure)
How can I substitute this code with SW/HW descriptions say at the source level?
Programming paradigm
(HW and SW descriptions coexisting in a program)
How can I automatically generate the “transformed” program?
Compilation
PACT ’04, Antibes, France
Outline
Program P
Program P’

A
DATA
GPP
RH
MEM
FPGA
What to do:
– Identify the “” code
– Show hardware feasibility of “” in FPGA
– Map “” into reconfigurable hardware (RH)
– Eliminate the identified code
– Add code to have “equivalent” behavior
– Compile new program
– Execute
Introduce reconfigurable microcode (- code)
Specific code in hardware left to the
programmer/hardware designer
RESULTS
Tools
Microarchitecture
Architecture
Programming Paradigm
Compiler
MOLEN
One time 8 new instructions for any ISA
Co-processor paradigm (e.g. vector)
New register file for parameter passing
Sequential consistency
Split-join parallelism
Function like code
PACT ’04, Antibes, France
Tool Chain
Human
Directives
Code
…
int fact(int n)
{
if(n<1) return n
else
return(n*fact(n-1));
}
New Program where
Hardware/software
descriptions co-exist
Architecture
Retargeted
Compiler
f(.)
C2C
Binary
Code
call f(.) HDL
…
NO
Critical
A
L
U
I
T
B
O
?
YES
HDL
hand coded
XILINX VIRTEX-II
PRO FPGA
IBM
PowerPC
PACT ’04, Antibes, France
The MOLEN ISA
Divide RC into two logical phases “SET EXECUTE address”
“function” independent
No new op-codes
Implementation and ISA independent
Reconfigurable design
(two instructions)
Parameter passing: two new instructions + Register file
Arbitrary number of parameter passing
Parallel execution : split via a Molen
instruction and join via a GPP
instruction or one special instruction
Modularity: by implementing at least the
minimal MOLEN instruction set and by
reconfiguring to it.
Execute on reconfigurable
One instruction
Speeding up: reconfiguration
and execution
Two instructions for prefetching
Total: 8
PACT ’04, Antibes, France
new instructions
( SAMOS ‘03 )
Instruction Set Partitioning
8 instructions grouped in 6 instruction categories:
partial SET (P-SET)
SET < address >
Complete SET (C-SET)
EXECUTE < address >
MOVTX and MOVFX.
Minimal
Preferred
SET PREFETCH < address >
EXECUTE PREFETCH < address
>
Complete
BREAK:
PACT ’04, Antibes, France
Sequence Control Example
#pragma call_fpga op1
int f( int x, int y)
{
…
}
#pragma call_fpga op2
int g(int x)
{
…
}
int h(int a, int b, int c)
{
int m,n, ...;
m=f(a, b);
no data dependency
n=g(c);
……
}
h:
mov a -> r1
movtx r1 ->XR2
mov b -> r2
movtx r2 ->XR3
mov c -> r3
movtx r3 -> XR4
set address_set_op1
set address_set_op2
ldc 2 ->r4
movtx r4 ->XR0
ldc 4 ->r5
movtx r5 ->XR1
execute address_ex_op1
execute address_ex_op2
movfx XR2 -> r6
mov r6 -> m
movfx XR4 -> r7
mov r7 -> n
PACT ’04, Antibes, France
In parallel
Reconfigurable Microcode Storage
MicroProgram
Frequently used
On-chip storage
Permanently stored
From memory
FIXED
Less
frequently used
Frequently used
Permanently stored
PAGEABLE
• Fixed on-chip storage for
frequently used microcode
• Pageable on-chip storage for less
frequently used microcode
( IEEE MICRO ‘03 )
PACT ’04, Antibes, France
The -code unit
R/P
 CS- /

Residence
Table
H
CS-
(fixed)
CS-, if present
(pageable)
Determine next
microinstruction
from execution hardware
= reconfigurable unit (CCU)
SEQUENCER
 CSAR
set
CS-
FIXED
PAGEABLE
execute
FIXED
PAGEABLE
- CONTROL STORE
PACT ’04, Antibes, France
M
I
R
to execution hardware
= reconfigurable unit (CCU)
microinstruction
More on Architectural support
An example microprogram:
• located in memory starting at address 
Instruction format
• address  point to first microinstruction
• terminated by an end_op
memory
OPC
instruction word
address

Resident (0);
Pageable (1)
Control Store address (CS-);
Memory address ()
end_op
PACT ’04, Antibes, France
00: load values into adder
01: shift_ins
02: add_ins
03: shift2_ins
04: SKIP
05: BACK
06: store
07: end_op
The MOLEN -coded processor
(FPL’01)
Arbitrates (redirects) instructions
between GPP and RP
Main Memory
The arbiter also controls
the loading of microcode
X registers to
exchange parameters
between GPP and RU
Register File
-unit controls
CCU by
microinstructions
Instruction
Fetch
Data Load/
Store
ARBITER
DATA
MEMORY
MUX/DEMUX
Core
Processor
Exchange
Registers
PACT ’04, Antibes, France
reconfigurable
microcode
unit
CCU has direct
access to the
data memory
CCU
Reconfigurable Processor
The Molen Prototype
Molen machine
organization
Molen prototype
implemented on
Virtex II Pro
PACT ’04, Antibes, France
The Prototype Features
A VHDL model has been synthesized for Virtex II Pro technology
• 64KBytes data and 64KBytes instructions (on-chip) mems;
• 64-bit data memory bus;
• 64-bit instruction memory bus;
• 64 bits microcode word length;
• 32MBytes, memory segment for microprograms;
• 8Kx64-bit -control store using Dual Port Block RAMs
(BRAM);
• 512x32-bit XREGs implemented in BRAMs.
Three clock domains:
• PPC clock – 250MHz;
• MEM clock – 83 MHz;
• User clock – external.
Trivial HW costs
Utilization of FPGA resources (no CCU):
Device xc2vp20-5
Reconf.
Processor
Arbiter
Total incl.
XREGs
Available
resources
%
# slices
71
84
156
10304
1
# flip-flops
84
69
147
20608
1
171
150
322
20608
1
4
N.A.
5
112
3
130
143
130
# LUT4
# BRAM
Max. Freq. [MHz]
N.A. N.A
( FCCM 04 )
PACT ’04, Antibes, France
Compiling for the Molen
C application
File_n.c
Compiler
MAIN.c
SUIF
frontend
Machine SUIF
backend framework
alpha
x86
backend
backend
MOLEN extension
PACT ’04, Antibes, France
FCCM
The Molen Compiler
• IBM PowerPC 405 GPP in Virtex II Pro
• Register file extension (XRs)
• ISA extension
SUIF +
MachineSUIF
Molen extension
( FPL 03-04 )
PowerPC
backend
ISA extension
(SET/EXEC)
Register extension
(XRs)
PACT ’04, Antibes, France
Code for a “function”
• Example:
C code: res = alpha(param1, param2);
movtx XR1 ← param1
movtx XR2 ← param2
set
<address_alpha_set>
exec <address_alpha_exec>
movfx res ← XR3
PACT ’04, Antibes, France
Send parameters
HW reconfiguration
HW execution
Return result
Sequence Control Example
Code generation:
C code
Original code
#pragma call_fpga op1 main:
int f(int a, int b){
mrk 2,13
int c,i;
ldc $vr0.s32 <- 5
c=0;
mov main.z <- $vr0.s32
for(i=0; i<b; i++)
mrk 2, 14
c = c + a<<i + i; ldc $vr2.s32 <- 7
c = c>>b;
cal $vr1.s32 <- f(main.z, $vr2.s32)
return c;
mov main.x <- $vr1.s32
}
mrk 2, 15
void main(){
ldc $vr3.s32 <- 0
int x,z;
ret $vr3.s32
z=5;
.text_end main
x= f(z, 7);
}
PACT ’04, Antibes, France
Modified code
mrk
mov
movtx
ldc
movtx
2, 14
$vr2.s32 <- main.z
$vr1.s32(XR) <- $vr2.s32
$vr4.s32 <- 7
$vr3.s32(XR) <- $vr4.s32
set
ldc
movtx
exec
address_op1_SET
$vr6.s32(XR) <- 0
$vr7.s32(XR) <- vr6.s32
address_op1_EXEC
movfx $vr8.s32 <- $vr5.s32(XR)
mov main.x <- $vr8.s32
The Experiment (hand tuned HW)
Step 1. Obtain MPEG-2 profiling data on a PowerPC system
sequence
carphone
claire
container
tennis
MPEG-2 encoder
#frames@Resolution SAD (16x16) DCT (8x8) IDCT (8x8)
96@176x144
51.1 %
12.5 %
1.3 %
168@360x288
53.8 %
11.8 %
1.0 %
300@352x288
56.2 %
10.7 %
1.0 %
112@352x240
60.0 %
9.5 %
0.8 %
Total
64.9 %
66.6 %
67.9 %
70.3 %
MPEG-2 decoder
IDCT (8x8)
50.4 %
37.6 %
40.4 %
40.5 %
Step 2. Measure the kernels speedups on the prototype:
carphone
claire
container
tennis
SAD16
SAD128
SAD256
6.5
8.3
12.2
12.1
18.9
23.9
35.2
35.0
22.2
28.2
41.5
41.2
Step 3. Overall speedup per kernel
SAD16
carphone
claire
container
tennis
1.76
1.90
2.07
2.22
MPEG-2 encoder
SAD128
SAD256
1.94
2.06
2.20
2.40
1.95
2.08
2.21
2.41
MPEG-2 decoder
DCT IDCT
1.14
1.13
1.12
1.10
1.01
1.01
1.01
1.01
PACT ’04, Antibes, France
IDCT
1.94
1.56
1.63
1.65
DCT IDCT
302.3
302.2
302.1
302.1
24.4
24.4
24.4
32.3
Real vs. Theoretical Speedups
Step 4. Application speedup
Speedup
MPEG-2 Encoder
MPEG-2 Decoder
Prototype theory %Smax Prototype theory %Smax
carphone
2.64
2.85
93
1.94
2.02
96
claire
2.80
2.99
94
1.56
1.60
98
container
2.96
3.12
95
1.63
1.68
97
tennis
3.18
3.37
94
1.65
1.68
98
Performance
gain
Recall
Smax
The MOLEN prototype
speeds the MPEG-2 codec
up between 93% and 98%
of the theoretically max.
attainable speedups.
SAD
T
i
m
e
T
a
SEi
TSE
DCT
Implem.
in reconf.
T = 0
i
Theoretically
attainable MAX
3.2
TSE  TSEi
Measured
experimentally
PACT ’04, Antibes, France
MPEG-2
Encoder
0.65 0.67 0.68
a
0.71
3.0
2.8
2.6
S
p
e
e
d
u
p
mpeg2enc Instruction Counts
35.1
33.7
100
100
100
100
92.2
91.4
91.5
54
100
91.5
91.5
86.4
61.7
54
48
43.6
35.1
33.7
instructions
branches
default
137 million
DCT
loads
SAD
stores
DCT & SAD
46 million
PACT ’04, Antibes, France
M-JPEG (HWAutomatically Generated )
• M-JPEG multimedia
benchmark
• DCT * hardware
implementation
• Molen prototype
( FPL 04 )
PACT ’04, Antibes, France
Performance
Millions
40
Tennis
Barbara
Artemis
MJPEG
34
30
2.5
speedup
20
14
10
0
SW Execution
HW Execution
Execution
SW DCT (%)
SW DCT
HW DCT
HW DCT conv
Prototype
speedup
Theoretical
Speedup
Efficiency
PACT ’04, Antibes, France
cycles
66 %
1,242,017
4,125
102,589
2.5 x
2.96 x
84 %
Conclusions
• We have shown a new:
• microarchitecture
• processor architecture
• programming paradigm
• compilation
• We have shown that it is easier and faster to
ride the Amdahl’s curve with polymorphic
processors!
PACT ’04, Antibes, France
Contact information
Computer Engineering Laboratory:
http://ce.et.tudelft.nl
MOLEN homepage:
http://ce.et.tudelft.nl/MOLEN
Personal homepage:
http://ce.et.tudelft.nl/~stamatis
OVERVIEW Paper:
The Molen Polymorphic Processor
IEEE Transactions on computers NOV 04
PACT ’04, Antibes, France