Course Embedded Systems 2003

Download Report

Transcript Course Embedded Systems 2003

Compilers for embedded systems:
Why are compilers an issue?
 Many reports about low efficiency of standard compilers
 Special features of embedded processors have to be exploited.
 High levels of optimization more important than compilation
speed.
 Compilers can help to reduce the energy consumption (energy
optimization).
 Compilers could help to meet real-time constraints.
 Less legacy problems than for PCs.
 There is a large variety of instruction sets.
 Design space exploration for optimized processors makes
sense
EE898- Compiler
-
1
-
Use of assembly languages in embedded systems
[Paulin, 1995]
C-Code
(DSP)
Assembler
(DSP)
C-Code
(µController)
Assembler
(µController)
Similar situation more recently
EE898- Compiler
-
2
-
Optimizations considered
•
•
•
•
•
•
Energy-aware compilation
Compilation for digital signal processors
Compilation for multimedia processors
Compilation for VLIW (very long instruction word) processors
Compilation for network processors
Compiler generation, retargetable compilers and design space
exploration
EE898- Compiler
-
3
-
Efforts for Reducing Energy
• Device Level
– Development of Low Power Devices
– Reducing Power Supply Voltage
– Reducing Threshold Voltage
• Circuit Level
– Gated Clock
– Path Transistor Logic
– Asynchronous Circuits
• System Level
• Fabrication level
EE898- Compiler
-
4
-
Optimization for low-energy the same as
optimization for high performance?
No !
• High-performance if available memory bandwidth fully used;
low-energy consumption if memories are at stand-by mode
• Reduced energy if more values are kept in registers ADD r3,r0,r2
LDR r3, [r2, #0]
int a[1000];
ADD r3,r0,r3
c = a;
for (i = 1; i < 100; i++) {
MOV r0,#28
b += *c;
LDR r0, [r2, r0]
b += *(c+7);
ADD r0,r3,r0
c += 1;
ADD r2,r2,#4
}
ADD r1,r1,#1
CMP r1,#100
BLT LL3
2231 cycles
2096 cycles
19.92 µJ
EE898- Compiler
16.47 µJ
MOV r0,#28
MOV r2,r12
MOV r12,r11
MOV r11,rr10
MOV r0,r9
MOV r9,r8
MOV r8,r1
LDR r1, [r4, r0]
ADD r0,r3,r1
ADD r4,r4,#4
ADD r5,r5,#1
CMP r5,#100
BLT LL3
-
5
-
Energy models
• Commercial tools frequently very imprecise
• Model of Tiwari (Dissertation, Princeton 1996):
Cost of instructions and of transitions between instructions;
Does not separate out the cost of memory access
• Model of Simunic, de Micheli (DAC 99):
Model based on data sheets; does not require measurements.
Does not take transitions into account.
• Russell, Jacome (ICCD, 1998): based on precise measurement
for two fixed configurations;
cannot predict effect of changes to memory architecture.
• Lee (LCTES 2001): detailed analysis of the effect pipeline
stages; does not include multi-cycle operations and stalls
 Dedicated energy models.
EE898- Compiler
-
6
-
Energy Model by Knauer and Steinke
VDD
ARM7
DAddr
ALU
Register File
Data
Reg
Value
Reg#
Imm
Opcode
Multiplier
IAddr
Instr
Barrel
Shifter
Data
Memory
Instr
Instruction
Memory
Instr. Decoder
& Control Logic
Etotal = Ecpu_instr + Ecpu_data + Emem_instr + Emem_data
EE898- Compiler
-
7
-
Instruction dependent costs in the CPU
Cost for a sequence of m instructions
Ecpu_instr =
MinCostCPU(Opcodei) +
1 *  w(Immi,j) +
ß1 *  h(Immi-1,j, Immi,j) +
2 *  w(Regi,k) +
ß2 *  h(Regi-1,k, Regi,k) +
3 *  w(RegVali,k) +
ß3 *  h(RegVali-1,k, RegVali,k) +
4 *  w(IAddri) +
ß4 *  h(IAddri-1, IAddri) +
FUCost(Instri-1,Instri)
w:
number of ones;
h:
Hamming distance;
FUCost: cost of switching functional units
, ß:
determined through experiments
EE898- Compiler
-
8
-
Other costs
Ecpu_data = 
5 * w(DAddri) +
6 * w(Datai) +
ß5 * h(DAddri-1, DAddri) +
ß6 * h(Datai-1, Datai)
Emem_instr =
MinCostMem(InstrMem,Word_widthi) +
7 * w(IAddri) +
ß7 * h(IAddri-1, IAddri) +
8 * w(IDatai) +
ß8 * h(IDatai-1, IDatai)
Emem_data = 
MinCostMem (DataMem, Direction, Word_widthi) +
9 * w(DAddri) +
ß9 * h(DAddri-1, DAddri) +
10 * w(Datai) +
ß10 * h(Datai-1, Datai)
EE898- Compiler
-
9
-
Results
• It is not important, which address bit is set to ‘1’
• The number of ‚1‘s in the address bus is irrelevant
• The cost of flipping a bit on the address bus is independent of
the bit position.
• It is not important, which data bit is set to ‘1’
• The number of ‚1‘s on the data bus has a minor effect (3%)
• The cost of flipping a bit on the data bus is independent of the
bit position.
EE898- Compiler
-
10
-
Compiler optimizations for improving energy efficiency





Energy-aware scheduling
Energy-aware instruction selection
Operator strength reduction: e.g. replace * by + and <<
Minimize the bitwidth of loads and stores
Standard compiler optimizations with energy as a cost function
E.g.: Register pipelining:
for i:= 0 to 10 do
C:= 2 * a[i] + a[i-1];
R2:=a[0];
for i:= 1 to 10 do
begin
R1:= a[i];
C:= 2 * R1 + R2;
R2 := R1;
end;
 Exploitation of the memory hierarchy
EE898- Compiler
-
11
-
3 key problems for future memory systems
1. (Average) Speed
2. Energy/Power
3. Predictability/Worst
Case Execution Time
Energy
Access times
smaller, faster, less energy
EE898- Compiler
-
12
-
1. (Average) Speed
Speed gap between processor and main DRAM increases
Speed
• early 60ties (Atlas):
page fault ~ 2500 instructions
• 2002 (2 GHz µP):
access to DRAM ~ 500
instructions
 penalty for cache miss soon
same as for page fault in Atlas
8
4
 2x
every 2
years
2
1
0
1
EE898- Compiler
2
3
4
5
years
[P. Machanik: Approaches to Addressing the
Memory Wall, TR Nov. 2002, U. Brisbane]
-
13
-
2. Power/Energy
Energy per access [nJ]
Example (CACTI Model):
2.5
2
1.5
1
0.5
0
64
128
256
512
1024
2048
4096
8192
Memory size [bytes]
[Steinke et al., Inf 12, UniDo, 2002]
EE898- Compiler
-
14
-
3. Predictability/WCET
• Predictability: For satisfying timing constraints in hard real-time
systems, predictability is the most important concern;
pre run-time scheduling is often the only practical means of
providing predictability in a complex system [Xu, Parnas]
Time-triggered, statically scheduled operating systems
• What about memory accesses?
– Currently available caches don‘t solve the problem:
• Improve the average case behavior
• Use “non-deterministic“ cache replacement algorithms
Scratch-pad/tightly coupled memory based predictability
EE898- Compiler
-
15
-
Hierarchical memories
using scratch pad memories (SPM)
Hierarchy
Example
main
SPM
Address space
processor
0
no tag
memory
ARM7TDMI cores,
well-known for low
power consumption
scratch pad memory
FFF..
EE898- Compiler
-
16
-
Exploitation of SPM
For i .{ }
Example:
for j ..{ }
board
while ...
Repeat
call ...
Main memory
(On-board)
Array ...
Which segment (array, loop, etc.) to
be stored in SPM?
Gain gi and size si for each segment i.
Maximise gain G = gi, respecting
constraint K   si.
Static memory allocation:
Solution: knapsack algorithm.
SPM
capacity K
?
Array
Dynamic reloading:
Finding optimal reloading points.
Processor
EE898- Compiler
Int ...
-
17
-
Cycles
Reduction in energy and average run-time
Multi_sort
(mix of sort
algorithms)
EE898- Compiler
-
18
-
Energy consumption per functional unit,
as a function of the SPM size
Parameters different
from previous slide
EE898- Compiler
-
19
-
Hardware-support for block-copying
DMA
MEMORY
CPU
• The DMA unit was modeled in VHDL, simulated, synthesized. Unit
only makes up 4% of the processor chip.
• The unit can be put to sleep when it is unused.
• Code size reductions of up to 23% for a 256 byte SPM were
determined using the DMA unit instead of the dynamic approach
that uses processor instructions for copying.
EE898- Compiler
-
20
-