Embedded Software Architecture for Low Power

Transcript Embedded Software Architecture for Low Power

1
NATURE: Non-Volatile Nanotube
RAM based Field-Programmable
Gate Arrays
Wei Zhang†, Niraj K. Jha† and Li Shang ‡
†Dept. of Electrical Engineering
Princeton University
‡ Dept. of Electrical and Computer
Engineering
Queen’s University
A Hybrid CMOS/NAnoTUbe REconfigurable
Architecture
Motivation
Background on CNT and NRAM
Architecture of NATURE
Logic Folding
Experimental Results
Conclusions
3
Motivation
Moore’s Law: What’s Next?
Carbon nanotubes (CNTs)
Nanowires
Single electron devices
...
Challenges in nano-circuits/architectures
Lack of a mature fabrication process
Defects and run-time failures
Reconfigurable architectures, such as an
FPGA, favored
Regular structures ease fabrication
Fault tolerance through reconfiguration
4
Motivation (Contd.)
Problems of existing reconfigurable architectures
High reconfiguration time overhead
Low area efficiency
Some recent works on programmable nanofabrics
Molecular logic array (Goldstein et al. [ICCAD 2002])
Nanowire PLA (Dehon et al. [FPGA 2004])
CMOS/nanowire hybrid architecture CMOL (Strukov
et al. [Nanotechnology 2005])
Fabrication problem not yet solved
5
Advantages of NATURE
CMOS fabrication
compatible
Run-time
reconfiguration
NATURE
Design
flexibility
NRAM-based
Temporal
logic folding
Hybrid design leverages
beneficial aspects of both
CMOS and CNT
technologies
NRAMs are distributed in
NATURE to store multicontext reconfiguration bits
Fine-grain reconfiguration
(even cycle-by-cycle)
Enables temporal logic
folding
Logic
density
Flexibility to perform
area-performance tradeoffs
One-to-two orders of
magnitude increase in
logic density
6
Background
Carbon nanotube (CNT)
Metallic or semiconducting
Single-wall or multi-wall
Diameter: 1-100nm
Length: up to millimeters
Ballistic transport
Excellent thermal conductivity
Very high current density
High chemical stability
Robust to environment
Source: Euronanotrade
7
Background (Contd.)
Source: Nantero
Non-volatile nanotube random-access memory
(NRAM)
Mechanically bent or not: determines bistable on/off states
Fully CMOS-compatible manufacturing process
Prototype chip: 10 Gbit NRAM
Will be ready for the market in the near future
8
NRAMs
Properties of NRAMs
Non-volatile
Similar speed to SRAM
Similar density to DRAM
Chemically and mechanically stable
NATURE not tied to NRAMs
Phase change RAM
Magnetoresistive RAM
Ferroelectric RAM
9
Architecture of NATURE
Length-1 Length-4
wire
wire
LB
Long wire
Switch box
Island-style logic
blocks (LBs)
connected by
various levels of
interconnects
Connection block
Length-4 wire
Direct link
Long wire
S1
S1
Switch
matrix
Switch block
S1: Switch box between
length-1 wires
S2: Switch box between
length-4 wires
SMB
An LB contains a
super macroblock
(SMB) and a local
switch matrix
Switch matrix: Local routing
network
S1
S1
Length-1 wire
10
Architecture of a Super Macroblock (SMB)
NRAM
MB
---1
---1
44
44
MB
---8
NRAM
---8
n1 macroblocks (MBs) comprise an SMB, here n1 = 4
SRAM
bits
SRAM
bits
48 to 16
crossbar
6
48 to 16
crossbar
---1
6
---1
From
Switch matrix
From
Switch matrix
---1
48 to 16
crossbar
SRAM
bits
MB
NRAM
CLK and Global
signals
Reconfiguration
bits
44
---8
MB
---8
---1
---1
NRAM
From
Switch matrix
48 to 16
crossbar
44
SRAM
bits
6
---1
6
32 Outputs
of SMB
CLK and Global
signals
Reconfiguration
bits
11
Architecture of a Macroblock (MB)
7
NRAM
8
LE
---1
---2
---2
LE
---4
---4
8
NRAM
---1
7
n2 logic elements (LEs) comprise an MB, here n2 = 4
48 SRAM
bits
48 SRAM
bits
12 to 4
crossbar
---4
---4
12 to 4
crossbar
Inputs to MB
8 Outputs
of MB
Inputs to MB
12 to 4
crossbar
12 to 4
crossbar
48 SRAM
bits
CLK and
Global signals
Reconfiguration
bits
7
LE
---1
---2
LE
---2
NRAM
---1
7
---4
---4
8
8
48 SRAM
bits
---4
---4
Inputs to MB
NRAM
CLK and
Global signals
Reconfiguration
bits
12
Logic Element and Interconnect
An LE implements a
computation and contains:
SRAM cells
An m-input look-up table
(LUT)
A flip-flop
A pass transistor
m-input
LUT
CLK
SMB
Interconnect
MB
MB
MB
NRAM
0
MB
---4
---2
One
input
---2
Length-1
64 tracks
---4
Length-4
128 tracks
Long wire
64 tracks
---8
Mixed wire segment
scheme
25%, 50% and 25%
distribution for length-1,
length-4 and long wires
Direct links from one LB to
its 4 neighbors
DFF
Direct link
128 tracks
(a)
13
Support for Reconfiguration
NRAM Structure
Bit line decoder
Word line decoder
Read
Voltage
Electrode
SRAM
Cell
Pulldown
Resistor
Reconfiguration time short: 160ps
Area overhead of NRAMs
k: no. of reconfiguration sets per NRAM, assume k = 16
Area overhead: 20.5% per LB, assuming 100nm technology for CMOS logic and
nanotube length
Logic density = k (conf. copies) x area per configuration = 16*(1-0.205)=12.75
Appropriate value for k obtained through design space exploration
14
Temporal Logic Folding
Basic idea: one can use NRAM-enabled run-time
reconfiguration to realize different Boolean functions in
the same logic element (LE) every few cycles
LUT3
d
g
LUT1
a
b
OUT
i
e
c
l
h
f
LUT2
NRAM
a
e
b
i
c
LUT
1
f
h
d
LUT
2
i = abc’
l
g
LUT
3
OUT
l = (i’+e’+f’)h’
OUT = d’g’+l
Cycle 1
Cycle 2
LUT
1
OUT
Cycle 3
15
Example
Without logic folding
x0 x1 x2 x3
With logic folding
y0 y1 y2 y3
x0 x1 x2 x3
LE2
LE1
a0
LE3
b0
c0
LE4
LE5
LE6
Out
LE2
LE1
Num of LEs
=6
Num of LEs
=2
a0
LE1
b0
Delay
= 4 LE delays
+Interconnect
delay
y0 y1 y2 y3
Reconfiguration
c0
LE1
Delay
=4*clock_period
LE2
LE1
Out
Clock period
=LE delay
+Reconfiguration
+Interconnect
delay
16
Folding Levels
Logic folding can be performed at different levels of granularity,
providing flexibility to perform area-performance trade-offs
A level-p folding implies reconfiguration of the LE after the
execution of p LUT computations
Macroblock1
z0 z1 z2
y0 y1 y2 y3
a0
b0
x0 x1 x2 x3
e0
Macroblock1
LUT
node
c0
x0 x1 x2 x3
d0
g0
Reconfiguration
y0 y1 y2 y3
a0
Macroblock2
z0 z1 z2
b0
c0
x0 x1 x2 x3 y0 y1 y2 y3 x0 x1 x2 x3
y0 y1 y2 y3
f0
d0
e0
f0
a2 a3 a4 a6
h0
Reconfiguration
a2 a3 a4 a6
h0
g0
i0
i0
d
Output
(a) level-1 folding
d
Output
(b) level-2 folding
17
Choosing the Folding Level
Folding
level
Clock period increases:
Routing delay increases
Number of clock cycles decreases
Reconfiguration time decreases
Number of LEs increases
Total delay typically decreases
Area increases
Advantages of logic folding
Significant flexibility for performing area-performance
trade-offs
Ability to map much larger circuits using the same
number of LEs
Significant improvement in the area/circuit delay
product
Reduction in the need for global routing
18
Experimental Setup
Instance of architecture: 4 MBs in an SMB, 4 LEs in
an MB, and LEs contain a 4-input LUT
Number of reconfiguration copies k varied in order to
compare implementations corresponding to selected
folding levels: level-1, level-2, level-4 and no logic
folding
Results based on 100nm CMOS technology
parameters
19
Experimental Results
#LEs * Delay for different folding levels
Delay (ns) for different folding levels
Lev el-1
Lev el-2
Lev el-4
Lev el-1
No-folding
Lev el-2
Lev el-4
No-folding
10
1.5
1.3
1.1
0.9
1
0.7
0.5
0.3
0.1
alu2
9symml
ldd
lal
cordic
poler8
cc
z4ml
cm163a
sct
alu2
9symml
ldd
lal
cordic
poler8
cc
z4ml
cm163a
sct
(normalized to level-1)
pm1
0.1
pm1
-0.1
(normalized to level-1)
Average area-time product advantage = 2X
Maximum area-time product advantage = 3X
20
Experimental Results (Contd.)
16-RCA: 16-bit ripple carry adder
16-CLA: 16-bit carry lookahead adder
16-CSA: 16-bit carry select adder
8-MUL: 8-bit multiplier
#LEs * Delay for different folding levels
Delay (ns) for different folding levels
Lev el-1
Lev el-2
Lev el-4
Lev el-1
No-folding
Lev el-2
Lev el-4
No-folding
100
1.5
1.3
1.1
10
0.9
0.7
0.5
1
0.3
0.1
32-MUL
16-MUL
8-MUL
64-CSA
32-CSA
16-CSA
64-CLA
32-CLA
16-CLA
64-RCA
32-RCA
0.1
16-RCA
32-MUL
(normalized to level-1)
16-MUL
8-MUL
64-CSA
32-CSA
16-CSA
64-CLA
32-CLA
16-CLA
64-RCA
32-RCA
16-RCA
-0.1
(normalized to level-1)
Average area-time product advantage = 13X
Maximum area-time product advantage = 35X
21
Experimental Results (Contd.)
Flexibility in performing area-performance trade-off
For area-time (AT) product, larger the circuit depth,
more the advantages of level-1 folding relative to no
folding
For the 64-bit ripple-carry adder, this advantage is
about 35X
LE utilization and logic density very high, with a
reduced need for a deep interconnect hierarchy
22
Conclusions
NATURE: A novel high-performance run-time
reconfigurable architecture
Introduction of NRAMs into the architecture enables
cycle-by-cycle reconfiguration and logic folding
Choice of different folding levels allows the flexibility
of performing area-performance trade-offs
Logic density and area-time product improved
significantly
Can be very useful for cost-conscious embedded
systems and future FPGA improvement
23