Presentazione di PowerPoint

Download Report

Transcript Presentazione di PowerPoint

Hardware platforms for Embedded
computing
The energy/flexibility conflict
- Intrinsic Power Efficiency Operations/Watt
[MOPS/mW]
Ambient Intelligence
DSP-ASIPs
µPs
poor design
techniques
10
1
0.1
0.01
1.0µ
0.5µ
0.25µ
Necessary to optimize HW/SW;
otherwise the prize for software
flexibility cannot be paid!
0.13µ
0.07µ
Technology
[H. de Man, Keynote, DATE‘02;
T. Claasen, ISSCC99]
Architectural Choices
Flexibility
Prog Mem
Prog M em
P
Prog Mem
Satellite
Dedicated
Logic
P
Processor
Satellite
Satellite
Processor Processor
MAC
Unit
Addr
Gen
P
General
Purpose
P
Software
Programmable
DSP
Hardware
Reconfigurable
Processor
Direct
Mapped
Hardware
1/Efficiency (power, speed)
Performance
The Processor Design Space
Application specific
architectures
for performance
Embedded
processors
Microprocessors
Performance is
everything
& Software rules
Microcontrollers
Cost is everything
Cost
Area of processor cores =
Cost
Nintendo processor
Cellular phones
Another figure of merit
Computation per unit area
???
Nintendo processor
Cellular phones
Embedded vs. general-purpose
processors

Embedded processors may be optimized
for a category of applications.


Customization may be narrow or broad.
We may judge embedded processors
using different metrics:



Code size.
Memory system performance.
Preditability.
Microcontrollers
Memory
CPU
ROM
RAM
I/O
A single chip
Subsystems:
Timers, Counters, Analog
Interfaces, I/O interfaces
Microcontroller Architectures
Memory
Address Bus
CPU
0
Program
+ Data
Data Bus
2n
Von Neumann
Architecture
Memory
Address Bus
CPU
0
Fetch Bus
Program
Address Bus 0
Data Bus
Data
Harvard
Architecture
MCS-51 “Family” of Microcontollers




8051 introduced by Intel in late 1970s
Now produced by many companies in many
variations
The most pupular microcontroller – about 40%
of market share
8-bit microcontroller
“Original” 8051 Microcontroller
4096 Bytes
Program
Memory
Oscillator
and timing
128 Bytes
Data
Memory
Two 16 Bit
Timer/Event
Counters
Internal data bus
8051
CPU
64 K Byte Bus
Expansion
Control
Programmable
I/O
subsystem interrupts
External interrupts
Control
Parallel ports
Address Data Bus
I/O pins
Programmable
Serial Port Full
Duplex UART
Synchronous Shifter
Serial Output
Serial Input
Microcontrollers
- MHS 80C51 as an example Features for Embedded Systems
• 8-bit CPU optimised for control applications
• Extensive Boolean processing capabilities
• 64 k Program Memory address space
• 64 k Data Memory address space
• 4 k bytes of on chip Program Memory
• 128 bytes of on chip data RAM
• 32 bi-directional and individually addressable I/O lines
• Two 16-bit timers/counters
• Full duplex UART
• 6 sources/5-vector interrupt structure with 2 priority levels
• On chip clock oscillators
• Very popular CPU with many different variations
RISC processors


RISC generally
means highlypipelinable, one
instruction per cycle.
Pipelines of
embedded RISC
processors have
grown over time:



ARM7 has 3-stage
pipeline.
ARM9 has 5-stage
pipeline.
ARM11 has eightstage pipeline.
ARM11 pipeline [ARM05].
RISC processor families



ARM: ARM7 is relatively simple, no memory
management; ARM11 has memory
management, other features.
MIPS: MIPS32 4K has 5-stage pipeline; 4KE
family has DSP extension; 4KS is designed for
security.
PowerPC: 400 series includes several
embedded processors; MPD7410 is two-issue
machine; 970FX has 16-stage pipeline.
DSP Applications







Audio applications
MPEG Audio
Portable audio
Digital cameras
Wireless
Cellular
telephones
Base station




Networking
Cable modems
ADSL
VDSL

High-end




Mid-end



Wireless Base Station - TMS320C6000
Cable modem
gateways
Cellular phone - TMS320C540
Fax/ voice server





Storage products - TMS320C27
Digital camera - TMS320C5000
Portable phones
Wireless headsets
Consumer audio
Automobiles, toasters, thermostats, ...
Increasing
volume
Low end

Increasing
Cost
Another Look at DSP
Applications
DSP vs. General
Purpose MPU

The “MIPS/MFLOPS” of DSPs is speed of MultiplyAccumulate (MAC).


The "SPEC" of DSPs is 4 algorithms:





Inifinite Impule Response (IIR) filters
Finite Impule Response (FIR) filters
FFT, and
convolvers
In DSPs, algorithms are king!


DSP are judged by whether they can keep the multipliers
busy 100% of the time.
Binary compatability not an issue
Software is not (yet) king in DSPs.

People still write in assembly language for a product to
minimize the die area for ROM in the DSP chip.
Architectural Features of DSPs

Data path configured for DSP



Multiple memory banks and buses 





Bit-reversed addressing
Circular buffers
Specialized instruction set and execution control


Harvard Architecture
Multiple data memories
Specialized addressing modes


Fixed-point arithmetic
MAC- Multiply-accumulate
Zero-overhead loops
Support for MAC
Specialized peripherals for DSP
THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE
DESIGN!!!
Domain-oriented architectures
n-1
Application: y[j] = i=0 x[j-i]*a[i]
i: 0i  n-1: yi[j] = yi-1[j] + x[j-i]*a[i]
Architecture: Example: Data path ADSP210x
P
D
AX
Addressregisters
A0, A1, A2
..
i+1, j-i+1
Address
generation
unit (AGU)
x
MX
AF
AR
- Parallelism
- Dedicated
registers
x[j-i]
AY
+,-,..
a
MY
a[i]
MF
* x[j-i]*a[i]
+,yi-1[j]
MR
MR:=0; A1:=1; A2:=n-2;
MX:=x[n-1]; MY:=a[0];
for ( j:=1 to n)
{MR:=MR+MX*MY;
MY:=a[A1]; MX:=x[A2];
A1++; A2--}
DSP - Features (1)
• Multiply/accumulate (MAC) and zero-overhead loop
(ZOL) instructions (as shown)
• Heterogeneous registers (as shown)
• Separate address generation units (AGUs)
(as in ADSP 210x)
DSP - Features (2)
• Modulo
addressing:
Am++ 
Am:=(Am+1)
mod n
(implements ring
or circular buffer
in memory)
sliding window
x
t1
t2
t
..
x[n-2]
x[n-1]
x[0]
x[1]
..
Memory, t=t1
..
x[n-3]
x[n-2]
x[n-1]
x[n]
x[1]
Memory, t2=t1+1
Multiple memory banks or
memories
P
D
AX
Addressregisters
A0, A1, A2
..
Address
generation
unit (AGU)
AY
MY
MX
MF
AF
+,-,..
AR
Simplifies parallel fetches
*
+,MR
Very long instruction word
(VLIW) processors
Key idea: detection of possible parallelism to be done by
compiler, not by hardware at run-time (inefficient).
VLIW: parallel operations (instructions) encoded in one long
word (instruction packet), each instruction controlling one
functional unit. E.g.:
The Texas Instruments
TMS 320C6xx as an example
Bit in each instruction encodes end of parallel execution
31
Instr.
A
0 31
0 31
0 31
0 31
0 31
0 31
0
0
1
1
0
1
1
0
Instr.
B
Instr.
C
Instr.
D
Cycle
Instruction
1
2
3
A
B
E
C
F
D
G
Instr.
E
Instr.
F
Instr.
G
Instructions B, C and D use
disjoint functional units,
cross paths and other data
path resources. The same
is also true for E, F and G.
Parallel execution cannot span several packets.
Partitioned register files
• Many memory ports are required to supply enough
operands per cycle.
• Memories with many ports are expensive.
 Registers are partitioned into (typically 2) sets, e.g. for TI
C60x:
Data path A
Data path B
register file A
L1
S1
register file B
M1
D1
D2
Address bus
Data bus
M2
S2
L2
Instruction types are mapped to
functional unit types

There are 4 functional unit (FU) types:




M: Memory Unit
I: Integer Unit
F: Floating-Point Unit
B: Branch Unit
Instruction types  corresponding FU type,
except type A (mapping to either I or Mfunctional units).

Large # of delay slots,
a problem of VLIW processors
add
sub
and
or
sub
mult
xor
div
ld
st
mv
beq
The execution of many instructions has been started before it is
realized that a branch was required.
Nullifying those instructions would waste compute power
 Executing those instructions is declared a feature, not a bug.
 How to fill all „delay slots“ with useful instructions?
 Avoid branches wherever possible.
Predicated execution:
Implementing IF-statements
„branch-free“
Conditional Instruction „[c] I“ consists of:
• condition c
• instruction I
c = true => I executed
c = false => NOP
Predicated execution:
Implementing IF-statements
„branch-free“: TI C6x
if (c)
{ a = x + y;
b = x + z;
}
else
{ a = x - y;
b = x - z;
}
Conditional branch
Predicated execution
[c] B L1
NOP 5
B L2
NOP 4
SUB x,y,a
|| SUB x,z,b
L1:
ADD x,y,a
|| ADD x,z,b
L2:
[c] ADD x,y,a
|| [c] ADD x,z,b
|| [!c] SUB x,y,a
|| [!c] SUB x,z,b
max. 12 cycles
1 cycle
Architecture Evolution
PE
I/0
SRAM
PE
PE
SRAM
CPU
PE
Local
Memory
hierarchy



DRAM
i/o
I/O
I/O
P
E
R
I
P
H
E
R
A
L
S
Roadmap continues: 906545 nm
“Traditional” Bus-based SoCs fit in one tile !!
Communication demand is staggering, but unevenly
distributed, because of architectural heterogeneity
3D stacked main memory
PE
I/0
Multicores Are Here!
512
[Amarasinghe06]
Picochip
PC102
# of cores
256
Ambric
AM2045
Cisco
CSR-1
Intel
Tflops
128
64
32
Raw
Raza
XLR
16
Niagara
8
Boardcom 1480
4
2
1
4004
8080
8086
286
386
486
Pentium
8008
1970
1975
1980
1985
1990
Cavium
Octeon
Cell
Opteron 4P
Xeon MP
Xbox360
PA-8800 Opteron
Tanglewood
Power4
PExtreme Power6
Yonah
P2 P3 Itanium
P4
Athlon
Itanium 2
1995
2000
2005
20??
MPSoC – 2005 ITRS roadmap
60
1200
50
1000
878
40
800
669
30
600
526
424
20
400
348
268
212
10
16
23
32
46
63
79
101
133
200
161
0
0
2005
2006
2007
2008
2009
Number of Processing Engines
(Right Axis)
2010
2011
2012
2013
2014
Total Logic Size
(Normalized to 2005, Left Axis)
2015
2016
2017
2018
2019
2020
Total Memory Size
(Normalized to 2005, Left Axis)
Number of Processing Engines
Logic, Memory Size (Normalized to 2005)
[Martin06]
Power (W), Power Density (W/cm2)
Power is the Challenge!
1400
10 mm Die
1200
1000
SiO2 Lkg
SD Lkg
Active
800
600
400
200
0
90nm 65nm 45nm 32nm 22nm 16nm
Technology, Circuits, and Architecture
to constrain the power
Near Term Solutions



Move away from Frequency alone to
deliver performance
More on-die memory
Multi-everywhere
Multi-threading
 Chip level multi-processing



Throughput oriented designs
Performance by higher level of
integration
Architecture Techniques
Multi-threading
Single Thread
Increase on-die Memory
Full HW Utilization
75%
Multi-Threading
MT1 Wait for Mem
Wait
MT2
50%
Pentium® III
25%
486
Pentium®
Pentium® 4
MT3
0%
1u
0.5u
Wait for Mem
ST
Pentium® M
0.25u
0.13u
Improved performance, no impact
on thermals & power delivery
65nm
Chip Multi-processing
C1
Large
Core
C2
Cache
C3
C4
Relative Performance
Cache % of Total Area
100%
3.5
3
Multi Core
2.5
2
1.5
Single Core
1
1
2
3
Die Area, Power
4
Multi-Core
Power
Cache
4
Power = 1/4
Performance
Performance = 1/2
3
Large Core
C1
C2
Cache
C3
C4
2
2
1
1
4
4
3
3
2
2
1
1
Small
Core
1
1
Multi-Core:
Power efficient
Better power and
thermal management
Embedded vs. General Purpose
Server Applications
Embedded Applications

Asymmetric Multi-Processing







Coherent memory
Shared local memories
HW FIFOS, other direct connections
Dataflow programming models
Classical example – Smart
mobile – RISC + DSP + Media
processors
Symmetric Multi-Processing


Mapped to dedicated processors
Configurable and extensible
processors: performance, power
efficiency
Communication


Differentiated Processors
Specific tasks known early


General tasks known late




Tasks run on any core
High-performance, high-speed
microprocessors
Communication


Homogeneous cores
large coherent memory space on
multi-core die or bus
SMT programming models
(Simultaneous Multi-Threading)
Examples: large server chips (eg
Sun Niagara 8x4 threads),
scientific multi-processors
MPSoC architectures
Example system platforms




Generic
Automotive
Wireless
Multimedia
PC-based platform

Basic hardware components:






CPU;
memory;
timers;
DMA;
minimal I/O devices.
Basic software:

BIOS.
PC-style hardware architecture
CPU
memory
system bus
DMA
controller
bridge
I/O
high-speed bus
timers
bus
interface
low-speed bus
I/O
Strong ARM

StrongARM system includes:


CPU chip (3.686 MHz clock)
system control module (32.768 kHz
clock).






Real-time clock;
operating system timer
general-purpose I/O;
interrupt controller;
power manager controller;
reset controller.
Pros and cons




Plentiful hardware options.
Simple programming semantics.
Good software development
environments.
Performance-limited.
TI Open Wireless Multimedia
Applications Platform

Dual-processor shared memory
system:
external memory
General-purpose
processor
GPP
OS
DSP
manager
DSP
Mem
ctrl
bridge
DSP
OS
http://www.ti.com/sc/docs/apps/wireless/omap/overview.htm
DSP
task
& I/O
ctrl
TI OMAP™ Hardware platform
Program
Memory
SDRAM
Memory & Traffic Controller





ARM9 core
16KB I-cache
8KB D-cache
2-way set
associative
150 MHz

I-MMU
I-Cache
D-MMU
MMU
D-Cache
Internal
I-Cache
RAM/ROM
DMA
DSP Core
RISC Core
+
Appl Coprocessors
Peripherals
LCD Controller, Interrupt Handlers, Timers, GPIO, UARTs, ...




C55x DSP core
16KB I-cache
8KB RAM set
2-way set
associative
200 MHz
OMAPI Standard (ST/TI)


Goal: standardize the interfaces between
application processor and peripheral devices in
a mobile product
Provide standard services (APIs) in the OS that
can be used by application developers
STMicro Nomadik platform
Main Core
Memory System
HW Accelerators
I/Os
Nomadik SW platform

Compliant with OMAPI standard
Philips Digital Video Nexperia Platform
TriMedia™
MIPS CPU
MMI
TriMedia CPU
D$
PRxxxx
I$
DEVICE IP BLOCK
DEVICE IP BLOCK
DVP SYSTEM SILICON
I$
DEVICE IP BLOCK
DEVICE IP BLOCK
PI BUS
DEVICE IP BLOCK
.
.
.
D$
TM-xxxx
DVP MEMORY BUS
Library of Device
IP Blocks
• Image coprocessors
• DSPs
• UART
• 1394
• USB
…and more
SDRAM
PI BUS
General-purpose Scalable RISC
Processor
• 50 to 300+ MHz
• 32-bit or 64-bit
MIPS™
.
.
.
DEVICE IP BLOCK
Scalable VLIW Media
Processor:
• 100 to 300+ MHz
• 32-bit or 64-bit
Nexperia™
System Buses
• 32-128 bit
Nexperia-DVP Software
Nexperia™ -DVP Software
Architecture

Applications
Middleware
JavaTV, TVPAK, OpenTV,
MHP/Java, proprietary ...
Streaming and
Platform Software
Kernel: pSOS, Win-CE, JavaOS


Nexperia™-DVP Streaming
Software



Encapsulates implementation of
streaming media components
(hardware and software)
Nexperia™ Platform Software

Nexperia Hardware
Supports multiple OSs and
middleware software
Abstracts platform functionality
via consistent APIs
OS independent device drivers for
on-chip and off-chip devices
Infineon Automotive Platform
Applications

High Performance drives / servo drives,

Industrial control Robotics
Features

32-bit super-scalar TriCoreTM V1.3
CPU, 4 stage pipeline
Fully integrated DSP capabilities
Single precision floating point unit (FPU)
80 MHz at full industrial temperature range



32-bit peripheral control processor with
single cycle instruction (PCP2)
Memories
1.5 MByte embedded progr. flash with ECC
32 KByte data flash - EEPROM emulation
56 KBSRAM, 8 KB I$, 16 KB Imem




TC1166
8-channel DMA controller
Interrupt system with 2 x 255 hardware
priority arbitration levels serviced by
CPU and PCP2 Coprocessor
Triple bus structure: 64-bit local memory
buses to internal flash and data
memory, 32-bit system peripheral bus,
32-bit remote peripheral bus
MOSAIC SW Architecture & Components for
Automotive Dashboard and Body Control
Application Platform layer
(@ 10% of total SW)
Application
Libraries
Customer
Libraries
CCP
--------------Water
temp.
Odometer
Tachometer
Tachometer
Speedometer
Speedometer
SW Platform layer
(> 60% of total SW)
OSEK
RTOS
KWP 2000
Application
Specific
Software
Application Programming Interface
Transport
OSEK
COM
I/O drivers & handlers
(> 20 configurable modules)
Sys. Config.
Boot Loader
Controllers Library
HW layer
Nec78k
HC08
HC12
H8S26
MB90
SW
Platform
Reuse
> 70%
of total SW
Architecture trends
High performance for
narrow application field
Special Purpose processor
Dedicated hardware
DSP
Stream processor
Graphic processor
Network processor
Multiple Cores
Heterogeneous
Multiprocessor
Programmable
Hardware
FPGA、Reconfigurable systems
Dynamically Reconfigurable
Processors
Tile Processor
Homogeneous
Chip-multiprocessor
Configurable
Processor
Special
instructions
General purpose
CPU
Multiple
Cores
High performance for
wide application field
Task Specific (configurable)
Processors
RWTH AACHEN  Lisatek(CoWare);
IMEC Target Compiler T, ARM OptimoDE
PHILIPS  Siliconhive; TENSILICA, PicoChip…
SysC specs
Courtesy:
Target
Compilers T
Processor model
D
HDL GENERATOR
D
Applications
RETARGETABLE
Silicon
DP
ISA
µcode
RTL synthesis
COMPILER
Machine
code
MACD
APAC
SACH Y,1
NEG
LAR AR3,#X
…
INSTRUCTION SET
SIMULATOR
HDL
Model
Step
Break
Silicon
Parallelism at Three Levels
in Extensible Instructions
L operations packed in one long instruction
M copies of storage and function
register and constant inputs
reg reg
Multi-issue instruction
SIMD operation
reg const
Three forms of instruction-set parallelism:
op
op
op
N dependent
operations
implemented
as single
fused
operation
reg
Fused operation
• Very Long Instruction Word (VLIW)
• Single Instruction Multiple Data (SIMD) aka “vectors”
• Fused operations aka “complex operations”
Parallelism: L x M x N
Example: 3 x 4 x 3 = 36 ops/cycle
Example:
SAD (sum of absolute differences)
Original C Code
Vectorize?
NO
YES
8
addi
addi
l8ui
l8ui
2
4
liu9x8
liu9x8
sub9x8
fusion
sub
abs9x8
abs
cvt9_16
Sample Software Pipelined Schedule
Vector + Fusion + FLIX Configuration
loop j=1,n/8 by 2:
liu9x8[j];
liu9x8[j];
fusion[j-2]
liu9x8[j+1]; liu9x8[j+1]; fusion[j-1]
SLOT 0
fusion
add
short total=0;
char *p1, *p2;
for i=1,m
for j=1,n
total += abs(*p1++ - *p2++)
SLOT 1
add16x8
SLOT 2
Dynamically Reconfigurable Processors

Reconfigurable systems → Previous lesson

Flexible but It takes 10’s milliseconds for dynamic reconfiguration.
Dynamically Reconfigurable Processors




Improves area efficiency by changing hardware structure.
IPs used in various SoCs.
History




Various commercial products are available since 2000



Reconfigurable Co-processor Garp(1997), CHIMAERA(2000)
Multicontext reconfigurable devices WASMII(1992),Time-multiplexing
FPGA(1997), PipeRench(1998), DRL(1998)
Functional-level synthesis
IPFlex DAPDNA-2, NEC electronics DRP-1, PACT Xpp, Elixent DFabrix
SONY’s VME(Virtual Mobile Engine) is embedded in Network Workman
and PSP
Recently, many Japanese vendors start to develop commercial products





Fujitsu
Hitachi
Lucent
Sanyo
Toshiba (Mep+D-Fabrix)
What is Configurable
Computing?
Spatially-programmed connection of
processing elements
“Hardware” customized to
specifics of problem.
Direct map of problem
specific dataflow, control.
Circuits “adapted” as
problem requirements
change.
Spatial vs. Temporal
Computing
Spatial
Temporal
Processor vs. FPGA Area
Processing Element
Specialized for media/stream processing
Coarse grain ⇔ Fine grain: LUT of FPGAs
 Components







ALU
Shifter+Mask unit
Multiplexers
Registers
Operations and interconnection between
components are changeable
No instruction fetch mechanism : A part of
large datapath
Reconfigurable HW (DSP fabric)

Target signal processing and arithmetic intensive
applications

Reconfigurable array of simple DSP core (CNode)

Low power architecture


Hierarchical clock gating
Distributed leakage control (fine grain power gating)

Programmable DMA engine

Reconfigurable at run time, multi task
Mapping Flow
DFG
Behavioral code
Procedure(In,Out,inout)
Coarse grained
configuration
Constant A,b,c,…;
Partitioning/
static scheduling
Begin
X=a-in[0];
Level 1
M
U
X
N0_i
Clusters
Level0
……..
Mux level 2
End;
N0_o
N1_i
N1_o
Data out
N2_i
N2_o
Data in
• Alus execute a cyclic micro-sequence
Data in
Data out
Data in
• Data exchanges through hierarchical
clustered interconnect
Data out
Data in
ILP + software pipelining
Data out
• Configuration step is sequence loading
and interconnect programming
Mapping Flow


3D optimization problem
(place/route/schedule)
Traditional scheduling techniques for VLIW
or clustered VLIW don’t apply


The solution don’t take into account the spatial
dimension of the problem
Traditional P&R used in FPGA don't apply
neither because they don't consider the
time dimension
Putting it all together
2004
2006
2008
2010
2012
Technology Node (nm)
90
65
45
32
22
Loosely coupled Sub-Systems
2
4
6
8
12
General Purpose CPU
Hardware Accelerator





Single
Hardwired
Multiple
Reconfigurable
Constant SoC Die Size
Slow evolution of peripherals (area decrease)
GP CPU sub-system complexity 2x each node (constant area),
Embedded Memory capacity 2x at each node (constant area)
Loosely coupled DSP sub-system complexity increase by 30%
at each node (30% area decrease)
What can fit in 45mm² in 45nm
Programmable Multimedia Accelerator
L1
L1
L1
L1
L1
L1
DSP
DSP
DSP
DSP
DSP
DSP
192 CNode
HW
HW
HW
HW
HW
(40 GOPS)
DMA
DMA
DMA
DMA
DMA
Video
H/W
HW
Imaging
H/W
DMA
Interconnect
L2
4MB Multi-port
Embedded
Memory
L1
L1
Host
Host
Core 1 Core 2
Peripherals
& analog