The Potential of a Temperature-Aware Configurable Cache on

Download Report

Transcript The Potential of a Temperature-Aware Configurable Cache on

Energy Consumption Evaluation of
an Adaptive Extensible Processor
Hamid Noori, Farhad Mehdipour, Maziar Goudarzi,
Seiichiro Yamaguchi, Koji Inoue, and Kazuaki Murakami
Kyushu University
December 2007
Outline






Introduction
General Overview of the Proposed
Approach
Multi-Exits Custom Instructions
Energy Consumption Evaluation
Evaluation Results
Conclusions and Future Work
Kyushu University
2
Outline






Introduction
General Overview of the Proposed
Approach
Multi-Exits Custom Instructions
Energy Consumption Evaluation
Evaluation Results
Conclusions and Future Work
Kyushu University
3
Introduction (1/2)

Embedded processors have to achieve




Key point


Low cost
High-performance
Low-power or low-energy consumption
How can processors adapt to target applications?
Solution: ASIP w/ Re-configurability

Application specific ISA


Provide custom instructions (CIs)
Implement re-configurable FUs
Kyushu University
4
Introduction (2/2)

Adaptive, extensible processor [DATE’07]




Question


Has a coarse-grain re-configurable functional unit
Supports efficient “Multi-Exits CIs”
Achieves high-performance and low-cost
How about energy efficiency?
Results: Energy saving


v.s. base processor: 42%
v.s. single basic-block based CIs: 15%
Kyushu University
5
Outline






Introduction
General Overview of the Proposed
Approach
Multi-Exits Custom Instructions
Energy Consumption Evaluation
Evaluation Results
Conclusions and Future Work
Kyushu University
6
ADaptive EXtensible processOR
(ADEXOR)
Generating and adding CIs AFTER chip fab.
Chip Fabrication
Testbench
Applications
RFU
Base
Proc.
Config
Mem
Normal Phase
New object
code
Configuration
Bits
T
o
o
l
C
h
a
in
Generating
RFU
(Section 4.4)
Utilization phase
Configuration Phase
Target
Application
Design Phase
Synthesis, verification,
layout, etc

ADEXOR
ADEXOR
Instruction Dispatcher
+
&
x
LD/ST
CFU1
CRFU
Register File
Kyushu University
7
Config
Mem
Execution Overview of ADEXOR
400680
400688
400690
400698
4006a0
4006a8
4006b0
4006b8
4006c0
4006c8
4006d0
4006d8
400680
400698
4006a0
400688
4006e0
.
.
.
.
subiu
lbu
lbu
sll
sra
addiu
srl
sll
addu
bgez
xori
addu
subiu
sll
sra
lbu
bgez
$25,$25,1
$13,0($7)
$2,0($4)
$2,$2,0x18
$14,$2,0x18
$4,$4,1
$8,$2,0x1c
$2,$8,0x2
$2,$2,$25
$10,4006f0
$13,$13,1
$10,$10,$2
$25,$25,1
$2,$2,0x18
$14,$2,0x18
$13,0($7)
$10,4006f0
Register File
ID/EXE Reg
ID/EXE Reg
RFU
Configuration
Memory
Indexed by mtc1
or sequencer
CRFU
ALU
MUX
Counter
Triggered by mtc1 or
sequencer
EXE/MEM Reg
GPP
Augmented HW
GPP: General Purpose Processor
Hot Basic Block
Kyushu University
RFU: Reconfigurable Functional Unit
8
Integrating the CRFU and the
Base Processor
Reg0
Triggered by mtc1
or sequencer
Reg31
………………………………...
.
From decode stage
Counter
En
DEC/EXE Pipeline Registers
ALU1
ALU2
ALU3
CRFU Input Regs
ALU4
CRFU
Config
Memory
Counter
Triggered by mtc1
EXE/MEM Pipeline Registers
Result bus
Kyushu University
9
or sequencer
Microarchitecture of the CRFU
CRFU Input Ports
Connections from input ports to
inputs of the rows
Row1
Configuration
bits
FU
FU
FU
Configuration
bits
Configuration
bits
st
Outputs of 1 row to the
inputs of 3rd, 4th and 5th rows
Outputs of 2nd row to the
inputs of 4th and 5th rows
Row5
CRFU Output Ports
Kyushu University
FU
10
Outline





Introduction
General Overview of the Proposed
Approach
Multi-Exits Custom Instructions
Evaluation Results
Conclusions and Future Work
Kyushu University
11
Why Multi-Exits Custom
Instructions (MECIs)?
Conventional BB-base CI Generation
(Single-Enter Single-Exit)
#Required nodes: 4
BB1
adpcm
0
1
2
BB3
BB2
3
bgez
beq
5
7
8
BB4
10
9
11
12
95%
30
…………….
bne
20
BB6
19
18
17
16
15
bne
5%
14
BB5
Assume 20 nodes can be included in one CI in maximum
12
Kyushu University
Why Multi-Exits Custom
Instructions (MECIs)?
BB-base CI w/ Conditional Execution Support
(Single-Enter Single-Exit)
#Required nodes: 22 (can not map)
BB1
adpcm
0
1
2
BB3
BB2
3
bgez
beq
5
7
8
BB4
10
9
11
12
95%
30
…………….
bne
20
BB6
19
18
17
16
15
bne
5%
14
BB5
Assume 20 nodes can be included in one CI in maximum
13
Kyushu University
Why Multi-Exits Custom
Instructions (MECIs)?
Multiple-Exits Custom Instruction
Conditional Execution + Hot-Path Selection
#Required nodes: 17
BB1
adpcm
0
1
2
BB3
BB2
3
bgez
beq
5
7
8
BB4
10
9
11
12
95%
30
…………….
bne
20
19
18
17
16
15
Exit 5%
14
Exit
BB6
bne
BB5
Assume 20 nodes can be included in one CI in maximum
14
Kyushu University
Main features of MECIs

Fixed point operations √




Multiply x
Divide x
Control flow √
Memory instructions x
Kyushu University
15
Custom Instruction Invocation

How to change the execution
sequence and run custom instructions
on the CRFU?


Software (mtc1-like instruction) method
Hardware (table look-up) method
Kyushu University
16
Software method
exit4
mtc1
0
2
3
bgez
beq
5
7
8
10
11
12
bne
exit3
exit1
bne
20
19
exit2
inst. #
address
inst.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
400410
400418
400420
400428
400430
400438
400440
400448
400450
400458
400460
400468
400470
400478
400480
400488
400490
400498
4004a0
4004a8
4004b0
4004b8
4004c0
addu
lw
addiu
subu
bgez
addiu
beq
subu
addu
lw
slt
addu
ori
bne
addiu
subu
addu
sra
slt
ori
subu
bne
slt
operands (dest, src1, src2) inst. #
R13 R0
R23 100
R4 R4
R3 R2
400440
R13 R0
400468
R3 R0
R10 R0
R8 R9
R2 R3
R8 R8
R10 R10
4004a8
R10 R0
R3 R3
R8 R8
R9 R9
R2 R3
R10 R10
R3 R3
400410
R2 R3
Code before generating MECI
Kyushu University
R0
R2
2
R11
R3
8
R13
R3
R0
0x3
R9
R9
1
R2
4
R9
R9
0x1
R9
2
R9
R2
R9
1
0
2
3
4
5
6
7
8
10
9
11
12
13
14
15
16
17
18
19
20
21
22
17
address
400410
400418
400420
400428
400430
400438
400440
400448
400450
400458
400460
400468
400470
400478
400480
400488
400490
400498
4004a0
4004a8
4004b0
4004b8
4004c0
inst.
lw
addu
mtc1
addiu
subu
bgez
addiu
beq
subu
addu
slt
lw
addu
ori
bne
addiu
subu
addu
sra
slt
ori
subu
bne
slt
operands (dest, src1, src2)
R23 100 R2
R13
#CIR0 R0
R4 R4 2
R3 R2 R11
400440 R3
R13 R0 8
400468 R13
R3 R0 R3
R10 R0 R0
R2 R3 R9
R8 R9 0x3
R8 R8 R9
R10 R10 1
4004a8
R2
R10 R0 4
R3 R3 R9
R8 R8 R9
R9 R9 0x1
R2 R3 R9
R10 R10 2
R3 R3 R9
400410 R2
R2 R3 R9
Code after generating MECI
Instruction
scheduling
Hardware method
0
2
3
bgez
beq
5
7
8
exit4
10
11
12
bne
exit3
exit1
bne
20
19
exit2
inst. #
address
inst.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
400410
400418
400420
400428
400430
400438
400440
400448
400450
400458
400460
400468
400470
400478
400480
400488
400490
400498
4004a0
4004a8
4004b0
4004b8
4004c0
addu
lw
addiu
subu
bgez
addiu
beq
subu
addu
lw
slt
addu
ori
bne
addiu
subu
addu
sra
slt
ori
subu
bne
slt
operands (dest, src1, src2)
R13 R0
R23 100
R4 R4
R3 R2
400440
R13 R0
400468
R3 R0
R10 R0
R8 R9
R2 R3
R8 R8
R10 R10
4004a8
R10 R0
R3 R3
R8 R8
R9 R9
R2 R3
R10 R10
R3 R3
400410
R2 R3
R0
R2
2
R11
R3
8
R13
R3
R0
0x3
R9
R9
1
R2
4
R9
R9
0x1
R9
2
R9
R2
R9
Code before generating MECI
Kyushu University
inst. #
address
inst.
1
0
2
3
4
5
6
7
8
10
9
11
12
13
14
15
16
17
18
19
20
21
22
400410
400418
400420
400428
400430
400438
400440
400448
400450
400458
400460
400468
400470
400478
400480
400488
400490
400498
4004a0
4004a8
4004b0
4004b8
4004c0
lw
addu
addiu
subu
bgez
addiu
beq
subu
addu
slt
lw
addu
ori
bne
addiu
subu
addu
sra
slt
ori
subu
bne
slt
operands (dest, src1, src2)
R23 100 R2
R13 R0 R0
R4 R4 2
R3 R2 R11
400440 R3
R13 R0 8
400468 R13
R3 R0 R3
R10 R0 R0
R2 R3 R9
R8 R9 0x3
R8 R8 R9
R10 R10 1
4004a8 R2
R10 R0 4
R3 R3 R9
R8 R8 R9
R9 R9 0x1
R2 R3 R9
R10 R10 2
R3 R3 R9
400410 R2
R2 R3 R9
Code after generating MECI
18
address
CI Index
400418
3
sequencer table (CAM)
Outline






Introduction
General Overview of the Proposed
Approach
Multi-Exits Custom Instructions
Energy Consumption Evaluation
Evaluation Results
Conclusions and Future Work
Kyushu University
19
Compare Energy Consumption
Energy
Conventional RISC
Processor (Base)
Clock
NCLOCK  ECYCLE
Components
($, RFs, FUs,…)
Off-chip
accesses
CRFU
Kyushu University
 AC
COMP
 ECOMP
comp
ADEXOR
NCLOCK  ECYCLE
 AC
COMP
 ECOMP
comp
N MISS  EOFFCHIP
---------------------------20
N MISS  EOFFCHIP
EOVERHEAD
Energy Overhead for CRFU
Energy
Overhead
Software-base
Invocation
# MECIs
CRFU Exe.
N
i 1
MECI ( i )
# MECIs
Config. Memory
Register-File
(for port sharing)
Invocation
Kyushu University
N
i 1
MECI ( i )
 ECRFU  MECI (i )
 ECONFIG  MECI
Hardware-base
Invocation
# MECIs
N
i 1
MECI ( i )
 ECRFU  MECI (i )
# MECIs
N
i 1
MECI ( i )
 ECONFIG  MECI
N RF  R /W  ERF OVERHEAD N RF  R /W  ERF OVERHEAD
---------------------------21
N IFETCH  ETABLE  LOOKUP
Energy Consumption
Pros.
 Low activity of hardware
components





I-Cache, Bpred
Decoder
Register File
Functional Unit
Cons.
 CRFU configuration




Reduce the energy for offchip accesses
Kyushu University
Increased complexity

Higher I-Cache hit rates
22
Accessing the config.
Memory
Setting control signals in
the CRFU
Communication between
the processor’s data-path
and the CRFU
Outline






Introduction
General Overview of the Proposed
Approach
Multi-Exits Custom Instructions
Energy Consumption Evaluation
Evaluation Results
Conclusions and Future Work
Kyushu University
23
Experimental Setup
Issue
4-way
L1-Instruction Cache
16K, 2 way, 1 cycle latency, miss penalty 20
cycles
L1- Data Cache
16K, 4 way, 1 cycle latency, miss penalty 20
cycles
ALUs
4 integer, 4 floating point
Multiplier
1 Integer, 1 Floating point
Divider
1 Integer, 1 Floating point
Branch predictor
bimodal
Branch prediction
table size
2048
Extra branch
misprediction
Kyushu University
3
24
Access Reduction
60
100
Reducing access to components (%)
90
70
Reg File
I-Cache
Int ALU
60
Result Bus
Miss of I-Cache
80
50
Decoder
Branch Predictor
40
30
50
40
30
20
20
10
10
Kyushu University
25
fft
gs
av m
gav seq
gm
tc
1
cr
c
ba
si
cm
a
bi
tc th
ou
nt
s
qs
or
su t
sa
n
cj
pe
g
dj
pe
di g
jk
st
pa ra
tr
bl icia
ow
fis
h
r
i
st
j
rin nda
gs el
ea
rc
h
sh
ad a
pc
m
0
HW Invocation
(Table Look-up)
Total Energy Reduction
Total energy reduciton (%)
90
80
clock gating
200 MHz
70
250 MHz
300 MHz
60
350 MHz
400 MHz
50
40
42%
30
50
40
30
20
20
10
10
0
s
h
r t an eg eg tra cia sh ael ch ha
m
at u n t so
s
s
pc
jp djp jks tri wfi jnd ear
u
m
q
c
d
o
i
a
s
c tc
a
d
p blo
ri
si
gs
bi
ir n
ba
st
Kyushu University
c
cr
26
t
ff
eq
m
gs g-s
av avg
1
tc
m
-
HW Invocation
(Table Look-up)
MECIs vs. CIs
40
90
MECI-clock-gating
Total energy reduction (%)
80
70
CI-clock-gating
MECI-300MHz
30
CI-300MHz
15%
60
50
20
40
30
20
10
10
27
0x
x
g10
av
g50
x
av
g10
av
g1x
m
av
gs
cr
c
ia
bl
ow
fis
h
ad
pc
m
tri
c
pa
st
ra
g
Kyushu University
di
jk
jp
e
or
t
qs
ba
si
cm
at
h
bi
tc
ou
nt
s
0
SW Invocation
(mtc1-like inst.)
Outline






Introduction
General Overview of the Proposed
Approach
Multi-Exits Custom Instructions
Energy Consumption Evaluation
Evaluation Results
Conclusions and Future Work
Kyushu University
28
Conclusions

Adaptive, Extensible Processor



Energy Efficiency



A coarse-grain re-configurable FU
Multi-Exits Custom Instructions
v.s. base-processor: 42% reduction
v.s. BB-base CIs: 15% more energy saving
Future Work

Chip implementation for accurate evaluations
Kyushu University
29
Backup Slides
30
Tool Chain for generating MECIs
Base Processor
Profiler
Simplescalar
(PISA
Configuration)
Detecting Start
Addr of HBBs
Reading HBBs
from Obj Code
Linking HBBs
and make a HIS
Kyushu University
Generating
CDFG
31
Generating
MECIs
Clock Energy Reduction
100
Clock Energy Reduction (%)
90
80
70
60
clock gating
200 MHz
250 MHz
300 MHz
350 MHz
400 MHz
50
40
30
20
10
0
h
rt an eg eg tra cia ish ael ch
ts
at un so
i
r
s
f
p
p
s
q
cj
dj ijk atr ow ijnd sea
su
cm tco
i
l
d
p
r
s bi
g
b
ba
rin
t
s
Kyushu University
32
a
m
sh pc
ad
c
cr
fft
m eq c1
gs g-s -mt
av avg
The Effect of Energy Overhead on
the Total Energy Reduction
45
clock-gating-seq
clock-gating-mtc1
Total energy reduction (%)
40
300 MHz-seq
300 MHz-mtc1
35
30
25
20
15
10
5
0
1x
10x
30x
OVH_COEF
Kyushu University
33
50x
100x
Synthesis result





Synopsys tools
Hitachi 0.18 μm
Area: 2.1 mm2
Configuration bits: 615 bits
Delay Depth of DFG Delay (ns)
of MECI
Kyushu University
1
2.2
2
4.2
3
6.1
4
7.9
5
9.8
Configuration Memory




615 configuration bits ~ 80 bytes
100 MECIs
80x100 bytes SRAM with a 640-bit
width data bus
CACTI


Energy for each access: 0.198 nJ
Area: 0.77mm2
Kyushu University
35
Sequencer

CACTI



0.29 nJ
Area: 0.61 mm2
Sequencer covers more dynamic
instructions but has more hardware
and energy overhead compared to
mtc1 approach
Kyushu University
36
Kyushu University
tri
c
st
ra
g
37
av
m
x
x
0x
g50
g10
g1x
g10
av
av
av
gs
cr
c
ia
bl
ow
fis
h
ad
pc
m
pa
di
jk
jp
e
70
or
t
80
qs
si
cm
at
h
bi
tc
ou
nt
s
ba
Total energy reduction (%)
MECIs vs. CIs
90
MECI-clock-gating
CI-clock-gating
MECI-300MHz
60
CI-300MHz
50
40
30
20
10
0
DATE2007
38
CRFU Architecture: A Quantitative Approach




22 programs of MiBench were chosen
Simplescalar toolset was utilized for simulation
CRFU is a matrix of FUs

No of Inputs

No of Outputs

No of FUs

Connections

Location of Inputs & Outputs
Some definitions:

Considering frequency and weight in measurement





CI Execution Frequency
Weight (To equal number of executed instructions)
Average = for all CIs (ΣFreq*Weight)
Rejection rate: Percentage of MECIs that could not be mapped on the CRFU
Mapping rate: Percentage of MECIs that could be mapped on the CRFU
Kyushu University
Inputs/Outputs
Mapping Rate
Inputs
100
90
80
70
60
50
40
30
20
10
0
Outputs
`
1
2
3
4
5
6
7
8
Number of Inputs/Outputs
Kyushu University
9
10
11
12
Functional Units
100
90
Mapping Rate
80
70
60
50
40
30
20
10
Number of FUs
Kyushu University
58
38
31
23
21
19
17
15
13
11
9
7
5
3
1
0
Width/Depth
Width without constraints
Width with constraints
Depth without constraints
Depth with constraints
100
90
Mapping Rate
80
70
60
50
40
30
20
10
0
1
2
3
4
5
6
7
8
Number of Width and Depth
Kyushu University
9
10
11
12
CRFU Architecture
Connections from input ports to
inputs of the rows
CRFU Input Ports
Configuration
bits
Configuration
bits
Row1
FU
FU
FU
FU
Adder/
subtractor
AND
OR
XOR
Barrel
Shifter
Configuration
bits
Outputs of 1st row to the
inputs of 3rd, 4th and 5th rows
Outputs of 2nd row to the
inputs of 4th and 5th rows
Row5
CRFU Output Ports
Kyushu University
Supporting Conditional
Execution
FU1
FU2
Configuration
Bits
Configuration
Bits
FU3
FU4
Selector-Mux
Configuration Bits
Branch result from FU1
Data
Selection
Mux
Branch result from FU2
Configuration Bits
Branch result from FU1
Branch result from FU2
Configuration Bits
Kyushu University
Configuration Bits
Experiment setup


22 applications of Mibench
Simplescalr
Issue
4-way
L1- I cache
32K, 2 way, 1 cycle latency
L1- D cache
32K, 4 way, 1 cycle latency
Unified L2
1M, 6 cycle latency
Execution units
4 integer, 4 floating point
RUU size & Fetch queue size
64
Branch predictor
bimodal
Branch prediction table size
2048
Extra branch misprediction latency
3
Kyushu University
Kyushu University
3
2.8
fft
g
Av sm
er
ag
e
cr
c
b
ba itcn
si ts
cm
at
h
qs
o
su rt
sa
n
cj
pe
g
dj
pe
g
la
m
di e
jk
st
r
st pat a
rin ric
gs ia
ea
bl rch
ow
fi
rij sh
nd
ae
l
sh
ad a
pc
m
Speedup
Speedup CIs & MECIs
CIs
MECIs
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
Kyushu University
at
h
nt
s
qs
m
sic
bi
tc
Av
m
er
ag
e
gs
fft
350 MHz
cr
c
m
pc
300 MHz
a
250 MHz
sh
200 MHz
ad
or
t
su
sa
n
cj
pe
g
dj
pe
g
la
m
di e
jk
st
r
pa a
t
st
rin ricia
gs
ea
r
b l ch
ow
fis
rij h
nd
ae
l
ba
Speedup
Effect of clock frequency of speedup
400 MHz
3.5
3
2.5
2
1.5
1