SysteMorph: An SoC Framework for Adaptive Dynamic

Download Report

Transcript SysteMorph: An SoC Framework for Adaptive Dynamic

An Integrated Temporal Partitioning
and Mapping Framework for Handling
Custom Instructions on a
Reconfigurable Functional Unit
Farhad Mehdipour†, Hamid Noori††, Morteza Saheb Zamani†, Kazuaki
Murakami††, Mehdi Sedighi†, Koji Inoue††
†Computer and IT Engineering Department, Amirkabir University of Technology
{mehdipur,szamani,msedighi}@aut.ac.ir
††Department of Informatics, Graduate School of Information Science and Electrical
Engineering, Kyushu University
[email protected], {murakami,inoue}@i.kyushu-u.ac.jp
Agenda






Introduction
General overview of the architecture
Generating Custom Instructions
Reconfigurable Functional Unit (RFU)
Tool Chain used for our quantitative
approach
Integrated Temporal Partitioning and
Mapping




The Integrated Framework
Incremental Temporal Partitioning Algorithm
Mapping Procedure
Experimental Results
Kyushu University
ACSAC 2006 - Shanghai, China
Introduction

Approaches for designing embedded SoCs
 Application Specific Integrated Circuits (ASICs)





General Purpose Processors (GPPs)





Availability of tools
Programmability
Low performance
High power consumption
Application Specific Instruction-set Processors (ASIPs)




Higher performance
Lower power consumption
Not flexible
Expensive and time consuming design process
More flexible than ASICs
Higher performance than GPPs
Long and costly design and verification
Extensible Processors


More flexibility
significant non-recurring engineering costs
Kyushu University
ACSAC 2006 - Shanghai, China
General overview of the architecture
Adaptive Dynamic Extensible Processor
N-way
in-order
general RISC
Base Processor
Fetch
Reg File
Detects start
addresses of
Hot Basic
Blocks
(HBBs)
Augmented Hardware
Decode
Profiler
Execute
RFU
Memory
Sequencer
Switches
between main
processor and
RFU
Write
Executes
Custom
Instructions
Kyushu University
ACSAC 2006 - Shanghai, China
Operation modes
Training Mode
Training Mode
Detecting
Start Address
of HBBs
Applications
Applications
Binary-Level
Profiling
Normal Mode
Running Tools
for Generating
Custom
Instructions,
Generating
Configuration
Data for RFU and
Initializing
Sequencer Table
Profiler
Processor
RFU
Profiler
Processor
Sequencer
Binary
Rewritin
g
Kyushu University
Applications
Monitors PC
and Switches
between main
processor and
RFU
RFU
Sequencer
Profiler
Processor
RFU
Sequencer
Executing
CIs
ACSAC 2006 - Shanghai, China
Integrating base processor with other
components
Register File
Configuration
Memory
ID/EXE Reg
Functional Unit
RFU
MUX
Profiler
EXE/MEM Reg
GPP
Kyushu University
Sequencer
Sequencer
Table
Online
Training
Profiler
Table
Augmented HW
ACSAC 2006 - Shanghai, China
Generation of Custom Instructions

Custom instructions




Limited to one Hot Basic Block (HBB)
Exclude floating point, multiply, divide and load instructions
Include at most one STORE, at most one BRANCH/JUMP
and all other fixed point instructions
Simple algorithm for generating custom instructions


HBBs usually include 10~40 instructions for Mibench
Custom instruction generator is going to be executed on
the base processor (in online training mode)
Kyushu University
ACSAC 2006 - Shanghai, China
Generating Custom Instructions
4052c0
4052c8
4052d0
4052d8
4052e0
4052e8
4052f0
4052f8
405300
405308
405310
405318
405320
405328
405330
405338
405340
405348
405350
405358
405360
405370
405378
405380
addiu
mov.d
sw
addu
sw
sw
mfc1
mfc1
srl
andi
sltiu
addu
sltiu
lui
and
andi
sll
or
mtc1
mtc1
lw
lw
addiu
jr
Kyushu University
$29,$29,-32
$f0,$f12
$18,24($29)
$18,$0,$6
$31,28($29)
$16,16($29)
$16,$f0
$17,$f1
$6,$17,0x14
$6,$6,2047
$2,$6,2047
$6,$6,$18
$2,$6,2047
$2,32783
$17,$17,$2
$2,$6,2047
$2,$2,0x14
$17,$17,$2
$16,$f0
$17,$f1
$31,28($29)
$16,16($29)
$29,$29,32
$31






Finding the biggest sequence of
instructions in the HBB that can be
executed on the ACC
Moving the instructions and appending
supportable instructions to the head of
the detected instruction sequence after
checking flow-dependency and antidependency
Moving the instructions and appending
supportable instructions to the tail of
the detected instruction sequence after
checking flow-dependency and antidependency
Rewriting object code if instructions
have been moved
Moving instructions, should not modify
the logic of the application
Custom instruction generation is done
without considering any other
constraints.
ACSAC 2006 - Shanghai, China
Reconfigurable Functional Unit (RFU)





RFU is a matrix of Functional Units (FUs)
RFU has configuration memory
FUs support only logical operations,
add/subtract, shifts and compare
RFU updates the PC after executing each CI
RFU has variable delay which depends on
depth of DFG of Custom Instructions
Kyushu University
ACSAC 2006 - Shanghai, China
RFU Architecture: A Quantitative
Approach



22 programs of MiBench were chosen
Simplescalar toolset was utilized for simulation
RFU is a matrix of FUs

No of Inputs

No of Outputs

No of FUs

Width

Depth

Connections

Location of Inputs & Outputs

Coverage (Mapping) rate: Percentage of generated CIs that can be
mapped on the RFU considering constraints

Considering frequency and weight in measurement



CI Execution Frequency
Weight (To equal number of executed instructions)
Average = for all CIs (ΣFreq*Weight)
Kyushu University
ACSAC 2006 - Shanghai, China
Tool Chain
Base Processor
Simplescalar
(PISA
Configuration)
22
Applications of
Mibench
Profiler
Detecting Start
Addr of HBBs
Reading HBBs
from Obj Code
Results are
used for
designing
RFU
Generating DFG
for HBBs
Mapping CIs on
the RFU
Kyushu University
Custom
Instruction
Generator
Optimization
(Constant
Propagation)
Updating
DFG
ACSAC 2006 - Shanghai, China
RFU Inputs (no constraint)
Input No Analysis-Optimized Version
96.37
89.37
98.48
120
100
Coverage
80
8
60
Series1
40
20
0
1
3
5
7
9
11
13
15
17
19
Input No.
Kyushu University
ACSAC 2006 - Shanghai, China
RFU Outputs (no constraint)
Output No. Analysis- Optimized Version
96.58
120
100
Coverage
80
6
60
40
20
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14
Output No.
Kyushu University
ACSAC 2006 - Shanghai, China
RFU Architecture

Distributing Inputs in
different rows






Row1 = 7
Row 2 = 2
Row 3 = 2
Row 4 = 2
Row 5 = 1
Connections with Variable
Length




row1  row3 = 1
row1  row4 = 1
row1  row5 = 1
row2  row4 = 1
Kyushu University
Synthesis results using Hitachi 0.18 μm
Area : 1.1534 mm2
Delay : 9.66 ns
ACSAC 2006 - Shanghai, China
Generating Custom Instruction for the
Target RFU


In our primary CI generator we did not
consider any constraints for the generated
CIs and tried to generate CIs as large as
possible.
Therefore, some of the generated CIs could
not be mapped on the proposed RFU due to
its constraints after fixing the architecture.
Kyushu University
ACSAC 2006 - Shanghai, China
Customizing CI generator for the
Target RFU – First Approach (CIGen)



Some primary constraints of the
RFU (number of inputs, number
of outputs and number of
nodes) were added to our CI
generator tool to generate CIs
that are mappable.
In this approach the CI
generator is unaware of the
mapping process results
Some of CIs may not be
ultimately mapped to the RFU
due to the routing and
connection constraints
Kyushu University
RFU Architectural
Constraints
CI Generation
Tool
CIs
YES
Final
Configurations
Mapping is successful?
NO
Rejected CIs
(Have to be run on base
processor)
ACSAC 2006 - Shanghai, China
Customizing CI generator for the
Target RFU – Second Approach

Integrated Framework



Performs an integrated temporal
partitioning and mapping process
Takes rejected CIs as input
Partitions them to appropriate
mappable CIs
CIs generated by
CI generation Tool
NO
Mapping is successful?
Integrated Framework
Temporal
Partitioning
Temporal
Partitions
YES

Mapping on RFU
Advantages


All generated CIs are mappable
Using a mapping-aware temporal
partitioning process
Kyushu University
Incremental
Temporal Partitions
Mapping is successful?
Incremental
Temporal
Partitioning
NO
YES
Final
Configurations
ACSAC 2006 - Shanghai, China
Integrated Framework- Incremental
Temporal Partitioning Algorithm

Incremental Temporal
Partitioning

The node with the
highest ASAP level is
selected and moved to
the subsequent partition.
1st Partition
0
1st Partition
1
0
2
1
0
5
4
1
5
6
Nodes selection and
moving order: 15, 13,
11, 9, 14, 12, 10, 8, 3
and 7.
4
2
RFU Map
10

2
6
9
6
5
8
4
4
3
10
6
5
7
12
8
7
14
9 3 11 2 13 1 15
2nd Partition
3
2nd Partition
8
10
9
11
12
3
10
12
14
8
11
13
15
7
7
14
9
Data Flow Graph of Input CI
13
15
RFU Map
Kyushu University
ACSAC 2006 - Shanghai, China
Mapping Custom Instructions

Mapping is the same as the well-known
placement problem:


Determining the appropriate positions for DFG
nodes on the RFU.
Assigning CI instructions to FUs is done
based on the priority of the nodes.
Kyushu University
ACSAC 2006 - Shanghai, China
An Example: Mapping of a CI on the
RFU
R0
R3 R0
R0
ADDU 2
1 SUBU
2
0x3
A Custom Instruction
R3
R10
SRA 3
1: SUBU R3, R0, R3
2: ADDU R10, R0, R0
3: SRA
R8, R10, 0x3
4: SLT
R2, R3, R8
5: BNE
R0,400488, R2
1
3
R8
SLT 4
4
400488
R2
BNE 5
R2
5
RFU Map
Data Flow Graph
Kyushu University
ACSAC 2006 - Shanghai, China
Customizing Mapping Tool
R18
7
R3 R17
0x1
1
SLL
SLL
R17
A Custom Instruction
1: SLL R2,R17,0x1
2: ADDU R2, R2, R17
3: SLL R2, R2, 0x3
4: ADDU R2, R2, R17
5: SLL R2,R2, 0x4
6: ADDU R2, R2, R20
7: SLL R3, R18, 0x2
8: ADDU R3, R3, R2
7
R2
R3
ADDU
0x3
1
2
2
R2
3
SLL
4
R17
R2
ADDU
0x4
4
3
5
6
R2
5
SLL
R20
8
R2
RFU Map
ADDU
6
: Critical path
R2
ADDU
8
R3
Data Flow Graph
Spiral shaped mapping is possible thanks to the horizontal
connections in the third and fourth rows of RFU
Kyushu University
ACSAC 2006 - Shanghai, China
CIs length for Mibench applications
Kyushu University
ACSAC 2006 - Shanghai, China
Percentage of rejected CIs for CIGen
60
50
40
30
20
Kyushu University
lame
gsm (enc)
gsm (dec)
djpeg
cjpeg
blowfish (dec)
blowfish
bitcounts
0
fft (inv)
10
fft
% of Rejected CIs
70
ACSAC 2006 - Shanghai, China
Initial and final number of partitions
Initial No. of Partitions
Final No. of Partitions
80
70
60
50
40
30
Kyushu University
sha
rijndael (dec)
rijndael (enc)
lame
gsm (enc)
gsm (dec)
fft (inv)
fft
djpeg
cjpeg
blowfish (dec)
0
blowfish
20
10
bitcnts
No. of Partitions
90
ACSAC 2006 - Shanghai, China
Kyushu University
0
sha
rijndael
(dec)
rijndael
(enc)
lame
gsm (enc)
gsm (dec)
fft (inv)
fft
djpeg
cjpeg
blowfish
(dec)
blowfish
bitcnts
Maximum Critical Path Length
Maximum critical path length for CIs
8
7
6
5
4
3
2
1
ACSAC 2006 - Shanghai, China
Performance Evaluation


issue
1-way
L1- I cache
32K, 2 way, 1 cycle latency
L1- D cache
32K, 4 way, 1 cycle latency
Unified L2
1M, 6 cycle latency
Execution units
1 integer, 1 floating point
RUU size
64
Fetch queue size
64
Simplescalar was configured to behave as a
MIPS324K processor. The base processor supports
MIPS instruction set.
22 applications of Mibench
Kyushu University
ACSAC 2006 - Shanghai, China
Delay of RFU according to CI length

CI Length
RFU Delay (ns)
1
1.38
2
2.28
3
3.12
4
4.89
5
6.47
6
7.57
7
8.65
8
9.66
Synopsys Tools + Hitachi 0.18μm
Kyushu University
ACSAC 2006 - Shanghai, China
Speedup
Speedup using integrated framework
Speedup using CI generation tool
2.4
2.2
Speedup
2
1.8
1.6
1.4
Kyushu University
sha
rijndael
(dec)
rijndael
(enc)
lame
gsm (enc)
gsm (dec)
fft (inv)
djpeg
cjpeg
blowfish
(dec)
blowfish
bitcounts
1
fft
1.2
ACSAC 2006 - Shanghai, China
Conclusions


Proposing a reconfigurable functional unit for
an Adaptive Dynamic Extensible Processor
using a quantitative approach.
Developing an integrated framework for
partitioning and mapping custom instructions
for the proposed RFU.
Kyushu University
ACSAC 2006 - Shanghai, China
Thank you for your attention.
Kyushu University
ACSAC 2006 - Shanghai, China