A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power

Download Report

Transcript A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power

A Self-Optimizing Embedded
Microprocessor using a
Loop Table for Low Power
Frank Vahid* and Ann Gordon-Ross
Dept. of Computer Science and Engineering
University of California, Riverside
*Also with the Center for Embedded Computer Systems, UC Irvine
This work was supported by the National Science Foundation and NEC
International Symposium on Low Power Electronics and Design, 2001
Introduction
• Mass-produced microprocessor
IC’s prevail in embedded systems
– Cheap
• From amortization and high yields
– Small and low power
• From optimization and use of new
technologies
Sample:
Annual production: 10 million units
Cost per unit: $2
– Available immediately
• Typically run one program
forever
Processor Dmem.
Periph. Pmem.
• QUESTION:
– Can we “tune” a mass-produced
microprocessor to its one program
to reduce power?
Frank Vahid, www.cs.ucr.edu/~vahid
2
Introduction
• Use configurable (tunable) components
and add a tuner circuit
Processor Dmem.
Tuner. Periph. Pmem.
• Make use of abundant transistors
– Previously, silicon too scarce
– Today, “transistor budgets have gone ballistic”
Report, 1998]
[Microprocessor
– Software analogy
• Previously, program memory was scarce
• Today, we find a flight simulator hidden in Excel’97
Moore’s Law: 2x / 18 months
1981
10,000
transistors
1984
Leading edge
chip in 1981
Frank Vahid, www.cs.ucr.edu/~vahid
1987
1990
1993
1996
1999
2002
150,000,000
transistors
Leading edge
chip in 2002
3
Introduction
• We introduce:
– Architecture and methodology for a selfoptimizing microprocessor that can tune itself
to its program
• Uses self-profiling circuitry and designeractivated self-optimization mode
• To illustrate, we introduce:
– A tunable component: Loop Table
• Similar to loop caches, differs in how and when
contents are updated
– Other tunable components are possible
Frank Vahid, www.cs.ucr.edu/~vahid
4
Problem Description
•
Goal:
– Develop a mass-producible standard
embedded microprocessor that can tune its
configurable components to one application
for low power
•
Constraints
1.
2.
3.
–
Exact instruction set compatibility
Avoid changing tool chain
Preserve cycle-by-cycle behavior
These constraints are more stringent than in
most previous work
Frank Vahid, www.cs.ucr.edu/~vahid
5
Related Work
• Application-specific instruction-set processors
– Introduce new instructions for frequent code
• Pre-fabrication: [Fischer99], [Tensillica00]
• Post-fab: [Kucukcakar99] – for mass-produced IC’s
• Modifies instruction-set and tool chain
• Code morphing
– Crusoe: Cache frequent code’s translation
• Helps only if performing dynamic binary translation
• Changes cycle-by-cycle behavior
• Code compression
– Compress frequent code
[Ishihara00]
• Modifies tool chain
Frank Vahid, www.cs.ucr.edu/~vahid
6
Related Work
• Cache frequent small loops
60%
– Reduces memory/bus power
– Filter cache [Kin97]
50%
40%
30%
• Small L0 cache
• Many misses (extra cycles)
10%
[Bellas99]
• Profiler/compiler marks frequent loops for
filter cache placement
• Modifies tool chain
– Transparent loop cache
[Lee99]
• Fill loop cache only when detect shortbackwards branch
• No tag comparisons – greater efficiency
– Our approach
• Moves profiler to chip, and can be more
selective in filling loop cache
Frank Vahid, www.cs.ucr.edu/~vahid
to
1
74 2 62
8
11 to 7
61
90
to
1
11
81 2 03
to
1
14
04 1 90
to
1
86 4 20
9
22 to 8
75
50
to
23
14
0%
12
03
– Compiler-assisted loop cache
20%
PID controller example:
most execution time spent
in two small loops
Pmem
Proc.
Pmem
Proc.
Loop
table
7
Architecture Overview
• Standard microcontroller
25
– ROM access consumes much power
– Added
Milliwatts
• Self-Profiling Controller and Loop Count
Table for profiling
• Loop Table to store common loops
• Bypass Controller to switch to Loop Table
RAM
RAM
20
ALU
15
10
Control
5
0
Ex1 bef
ROM
Ex2 bef
Ex3 bef
ROM
Configuration Memory
(~10’s of bytes)
Datapath
Controller
Loop Table
Bypass
Controller
Frank Vahid, www.cs.ucr.edu/~vahid
SelfProfiling
Controller
Microprocessor
Loop Count
Table
8
Methodology Overview
(Designer: prefabrication)
Designer: post-fabrication
User
Self-optimization mode activation
• Self-optimizing microcontroller
– Post-fabrication (hence mass-produced)
– In-system
– Tuning under designer control
• Not by end user, hence stable and consistent
end-use platform
Frank Vahid, www.cs.ucr.edu/~vahid
9
Methodology Overview
Download application to
microcontroller program memory
Reset microcontroller, causing (optimized)
application execution in normal mode
Activate self-optimizing mode, causing
update of configuration memory
Upload configuration memory for downloading to other
microcontrollers
Frank Vahid, www.cs.ucr.edu/~vahid
10
Self-optimizing mode
• Initializing
Download
program
Normal
mode
– Activated by extra pin or existing pin
combo
– Traverse memory, detect loops, add
addresses to loop count table
• Profiling
– Execute, update loop counts
Selfoptimizing
mode
• Requires fast increments
• We use fully-assoc. mem
• Hardware hash table possible
• Configuring
– Store most frequent loop addresses at
bottom of program memory, set flag
Upload
configuration
ROM
SelfProfiling Controller
Frank Vahid, www.cs.ucr.edu/~vahid
200
Loop Count Table
Loop addr.
100
200
Count
05
0900
11
Normal mode
• Reset
Download
program
Normal
mode
– Read loop addresses (if any) into registers (LAR’s)
– Read corresponding loops into loop table
– Set flag in bypass controller
• Execute: Check if flag set and address match
– No: Fetch from ROM
– Yes: Begin fetching from loop table
– No tag comparisons, no misses
– Pre-computed extra bits quickly detect table exit
Selfoptimizing
mode
RAM
ROM
200: ****
200
Loop Table
200: ****
Upload
configuration
Datapath
Controller
Bypass Controller
LAR: 200
Frank Vahid, www.cs.ucr.edu/~vahid
12
Results -- power
• Savings
Loop
table and
control
RAM
25
Milliwatts
20
15
ALU
10
5
Control
0
Ex1
bef
Ex1
aft
Ex2
bef
Ex2
aft
Ex3
bef
Ex1: checksum
Ex2: gcd
Ex3: matrix multiply
Frank Vahid, www.cs.ucr.edu/~vahid
Ex3
aft
ROM
– 34% total power savings after
self-optimization
– Dependent on technology
• Power overhead
– Negligible when selfoptimization idle
– Slight increase (5%) during
self-optimization
• Setup
– Synopsys synthesis,
simulation, and power
analysis
– 8051 synthesizable VHDL
model at UCR
(www.cs.ucr.edu/~dalton)
13
Results – size (in cells)
• Big increase, but:
– 8051 version was small
• Others much bigger
• Smaller % overhead
– Transistors becoming
cheaper
Subsystem
Controller
ALU
Decoder
RAM (256 bytes)
ROM (8 kbytes)
Select logic
Loop Count Table(32)
Loop Table(64)
Self-Profiler/Bypass
Total:
Original
3,391
2,100
586
17,312
11,000
34,389
Extended
3,767
2,100
586
17,312
11,000
132
33,595
16,740
7,188
92,420
– Product-oriented IC’s: loop table and controller, no SelfProfiler or Loop Count Table
– Transfer configuration from prototype-oriented part
to new product-oriented parts
– Supported by existing upload/download tools
– We are working on shrinking the Loop Count Table logic
Frank Vahid, www.cs.ucr.edu/~vahid
14
Conclusions
• Mass-produced IC’s give big advantages
• Transistor abundance provides new opportunities
• We introduced:
– A self-optimization methodology and architecture
– A loop table as an example tunable component
• These items yielded:
– Power savings by reducing ROM access
• 34% savings for 8051 microcontroller for target technology
– No change in instruction set, tools, or performance
• Future work includes:
–
–
–
–
–
Reducing size overhead while maintaining accuracy
Trading off size with accuracy
Extending loop table for multiple loops, subroutines, etc.
Incorporating into 32-bit processor environment (LEON Sparc)
Investigating other tunable components
• On-chip FPGA, configurable cache, etc.
Frank Vahid, www.cs.ucr.edu/~vahid
15