Variable Word Width Computation for Low Power

Download Report

Transcript Variable Word Width Computation for Low Power

Variable Word Width
Computation for Low Power
By
Bret Victor
Sayf Alalusi
Motivation
• 32 bit architecture required for most general
purpose computing
• However, many applications don’t need a full 32
bit data word:
–
–
–
–
Video: 24 bit
Audio: 16 bit
Text: 8 bit
Logic: 1 bit
• How can we exploit this to save power?
Possibilities
• Architecture that supports 32, 24, 16, 8, and 1 bit
operations? Or some subset?
• Switch processor between modes, or specify width
for each instruction? Global or distributed
control?
• Gated clocks? Don’t drive unused outputs?
Power down unused blocks?
Implementation
•
•
•
•
•
Based on MIPS architecture and ISA
Two widths: 16 bit and 32 bit
Width chosen on instruction-by-instruction basis.
Flag bit in instruction word selects width
Modified ISA:
–
–
–
–
arithmetic: add16, add32; mul16, mul32
logical: and16, and32
memory: lw16, lw32; sw16, sw32
branch compare: beq16, beq32
Energy
• Energy consumption occurs when a node
transitions, and is proportional to the capacitance
at that node.
• Prevent nodes from transitioning unnecessarily.
• Energy savings can be calculated by adding all the
capacitance that is switching.
Where We Save Energy
• Our design saves energy over a traditional
processor in three main areas:
– Clock and control line energy
– HWTE (High Word Transition Energy)
– Memory control energy
• We will see these three areas as we step through
the pipeline.
Pipeline Overview
branch address: 32
M
U
X
PC + 4: 32
32
branch
offset
+4
dest reg: 5
5
PC
32
I
$
5
32
immed: 16
+
32
srcA
srcB
32
outA
outB
32
5
dest
data
=
32
IF/ID
ID/EX
reg A
fwd from MEM
fwd from WB
M
U
X
ALU result: 32
32
ALU
32
32
reg B
fwd from MEM
fwd from WB
immed
32
M
U
X
32
addr
rd data
32
32
32
wr data
data for SW: 32
dest reg: 5
5
dest reg: 5
EX/MEM
MEM/WB
M
32
U
X
IF Stage
M
U
X
branch address: 32
PC + 4: 32
32
+4
32
PC
32
I
$
32
IF/ID
• Instruction words and addresses must be 32 bits.
• Can’t modify much.
ID Stage
branch address: 32
PC + 4: 32
branch
offset
32
immed: 16
+
dest reg: 5
5
5
srcA
srcB
32
outA
outB
32
5
dest
data
=
32
IF/ID
ID/EX
• We can:
– gate the clocks of the pipeline register
– only drive high words out of register file if 32 bit
operation
Pipeline Register (ID)
WidthGatedClock
UngatedClock
Clock
UngatedClock
reg A: high 16
reg A: low 16
reg B: high 16
C
Width
(from instruction word)
D
Q
WidthGatedClock
reg B: low 16
destReg: 5
ImmedGatedClock
immed: 16
• Fit gating into clock distribution network.
• Little energy overhead and helps control skew.
• On ID stage, gating reduces clock energy by:
– 56% on 16-bit operations
– 19% on 32-bit non-immediate operations
Register File Read Port (ID)
Width
Reg 0: low 16
N
16
16
Width
Reg 1: high 16
N
Reg 1: low 16
N
D
E
C
O
D
E
R
N
Reg 0: high 16
16
16
• Decoder selects register to
drive output bus.
• We add one AND gate per
register.
• Switching capacitance
dominated by output bus.
• 16 bit operation takes 50%
less energy than 32 bit
operation....
• Not necessarily savings!
EX Stage
reg A
fwd from MEM
fwd from WB
M
U
X
32
ALU
32
reg B
fwd from MEM
fwd from WB
immed
32
M
U
X
data for SW: 32
dest reg: 5
• Modify the ALU to perform 16 bit operations.
• Prevent the high word output of the MUXes from changing
on 16 bit operations.
• Gate the clock of the pipeline register:
– Only latch high word of ALU result on 32 bit operations
– Only latch reg B on “store word” operations
Logical Inst.’s (EX)
X0 ------Y0 -------
e.g. X AND Y
X1 ------Y1 -------
X31 -----Y31 ------
• Just don’t let the unused bits (high 16) transition
• If they don’t transition, they will not drive the next
stage either.
• 50% less energy
Adder (EX)
0
.
.
3
4
.
.
7
8
.
.
11
12
.
.
15
16
.
.
19
Upper Level
CLA
Generation
A0 B0 …
… An Bn
20
.
.
23
24
.
.
27
S0
Sn
28
.
.
31
• The 4CLA blocks just get replicated for the number of bits,
but the upper level CLA structure will grow with the
number of bits.
• 16 bits: 58% less energy
Multiplier (EX)
32 x 32bit adds
32 x 32bit reg. writes
32 shifts
In 32 cycles
Vs.
16 x 16bit adds
16 x 16bit reg. writes
16 shifts
In 16 cycles
• Multiply complexity grows as N2, so a 16 bit multiply takes
77% less energy.
• Even if upper 16 bits = 0, a 32 bit multiply does 16 extra
shifts.
HWTE
• Two types of data in 16 bit application:
– Computational data (16-bit): high word = 0
– Pointers and addresses (32-bit): high word = C
• Assume C “mostly constant” (memory accesses
mostly in 64K block)
• Traditional processor only consumes more
datapath energy than our processor when
transitioning between these data types.
• HWTE = High Word Transition Energy
HWTE
• With such a model, our processor effectively only
excecutes “16 bit operations”.
• Traditional processor excecutes “32 bit operations”
only when transitioning between data types.
• E32 = energy of 32 bit operation
• E16 = energy of 16 bit operation
• N = average number of consecutive instructions
that use the same data type
• HWTE = ( E32 - E16 ) / N
Barrel Shifter (EX)
A3
B3
A2
B2
A1
B1
A0
B0
SH0 SH1 SH2 SH3
• Big win will come from not driving the control lines to the
upper 16 bits.
• Save about 50% in energy
MEM Stage
ALU result: 32
32
32
addr
rd data
32
wr data
dest reg: 5
• This is a big, regular memory (SRAM) structure that can
easily be segmented into blocks.
– Exploit this fact
DCache (MEM)
Width
Block #
Only drive the
word line that
you need!
• 2-way set associative, write-back
• Blocks are 2 x 32b or 4 x 16b, i.e. the 16b data
values are aligned on 16b boundaries, 32 on 32b.
DCache (MEM)
• Only drive the word lines that are needed.
– Need a little bit of logic to figure out what the correct
lines are, but large capacitance of WL dominates.
• Block size is larger for 16 bit values, better
exploits spatial locality
• Associativity does not change from 16 bit to 32 bit
word lengths
• Energy savings: 50%
– Control Line Savings, no HWTE!
WB Stage
Dest reg: 5
srcA
srcB
Mem data: 32
ALU result: 32
M
U
X
5
outA
outB
dest
data
32
MEM/WB
• On a 16 bit operation, we can:
– Only drive the low word out of the MUX
• Capacitive load on register write port is large
• Driving 16 bits out of the MUX consumes 50% less energy
than driving 32 bits… HWTE formula applies.
– Only latch the low word into the register?
Reg. File Write Port (WB)
Write
HiWrite
Width
HiWrite
C
Reg 0: high 16
16
D
Write
D
E
C
O
D
E
R
C
Reg 0: low 16
D
16
HiWrite
C
Reg 1: high 16
D
16
Write
C
Reg 1: low 16
D
16
• We can add one AND
gate for each register.
• But 16 bit write uses
same amount of clock
energy as 32 bit write
without modifications.
• Little savings from not
writing into the register,
because the high word
would not change in a
16 bit application.
• Not worth it!
Summary
• Typical power distribution in core (non-memory):
–
–
–
–
–
–
ALU:
I-decode:
Register file:
Clock:
Shifter:
Pipeline:
34%
23%
13%
10%
11%
9%
x
x
x
x
x
x
66%
100%
66%
50%
50%
74%
• Core energy reduced by 29%.
Summary
• Typical power distribution in memory:
– Instruction cache
– Data cache
60%
40%
x 100%
x 50%
• Cache energy reduced by 20%.
• Total processor power consumption:
– Cache
– Core
66%
33%
x 80%
x 71%
• Total energy reduced by 24% when executing a 16
bit application.
Conclusions
• Primary drawback is modification of ISA.
• Energy savings are reasonable.
• Our modifications are fairly easy to implement,
and can be fit into existing processor designs with
minimal area increase.
Where do we go from here?
• More accurate capacitance models and SPICE
simulation
• More accurate models of instruction mix