Survey of Digital Signal Processors

Transcript Survey of Digital Signal Processors

Survey of Digital Signal
Processors
Michael Warner
ECD: VLSI Communication Systems
Agenda
 Industry Trends
 DSP Architecture
 DSP Micro-Architecture
 DSP Systems
Agenda
 Industry Trends
 DSP Architecture
 DSP Micro-Architecture
 DSP Systems
Moore’s Law Drives
Processor Development
1010
Transistors per Die
109
108
107
106
80286
105
8008
104
386™
Itanium®
Itanium2®
Pentium®
Pentium4® III
Pentium® II
Pentium
486™ ®
8086
8080
4004
103
1965 Data (Moore)
102
Microprocessor
101
100
‘60
‘65
‘70
‘75
‘80
‘85
‘90
‘95
‘00
‘05
‘10
Source: Intel internal
Doubling the number of transistors
every 18-24 at same price point drives
significant product opportunities
…especially if you have little regard for
power
But what if energy-delay had to be
reduced every generation by an order
of magnitude?
Gene’s Law Drives
DSP Development
1,000
Gene’s Law
100
DSP Power
1
0.1
0.01
0.001
0.0001
Year
2008
2006
2004
2002
2000
1998
1996
1994
1992
1990
1988
1986
1984
0.00001
1982
mW/MIPS
10
Gene’s Law will
have it’s
challenges to
hold the line!
What’s Driving Gene’s Law?
Digital Audio
 MP3
 Real Audio
Streaming Video
 MPEG 4
 H.263
Connectivity
 Internet
 Bluetooth
Modem Standards
 UMTS
 GMS
Buy
Now?
Yes
No
TXN 160 + 4 UPX 12 3/4
DSP Design Constraints
DEVICE CAPABILITIES
1982
1992
2002
Technology (uM)
3
0.8
0.1
Transistors
50K
500K
180M
MIPS
5
40
5,000
RAM (bytes)
256
2K
3M
Power (mW/MIPS)
250
12.5
0.1
Price/MIPS
$30.00
$0.38
$0.02
Agenda
 Industry Trends
 DSP Architecture
 DSP Micro-Architecture
 DSP Systems
What Makes a DSP a DSP?









Single-Cycle MAC
Multiple Execution Units
High Bandwidth (Flat) Memory Sub-Systems
Efficient Zero-Overhead Looping
Short Pipeline
High Bandwidth I/O
Specialized Instruction Sets
Sophisticated DMA
Little to No Speculation
Single Cycle MAC
 MAC’s Typically Determine DSP
Performance and Pipeline Length (EX)
 Most DSP’s Have 2-8 MAC Units
 MAC’s Typically Operate in Both a Scalar
and Vector Mode
Multiple Instruction Units
 VLIW Architectures Driving ILP
 Typically Instruction Units




M-Unit - MAC
S-Unit - Shift
L-Unit - ALU
D-Unit – Load/Store
 Industry Has Converged on a ILP of ~8
Registers A0 - A15
Registers B0 - B15
1X
S1
2X
S2
D DL SL
L1
SL DL D S1
S1
S2
D S1
S2
M1
DDATA_I1
(load data)
D S1 S2
S2 S1 D
S2
S1 D
D1
D2
M2
S2
S1 D DL SL
S2
DDATA_I2
(load data)
SL DL D
S2
L2
S1
High Bandwidth
Memory Sub-Systems
 Multiple Load-Store Units Required to Feed Data Path
 Tightly Coupled Memory is Typically Dual Ported
 Harvard Architecture is Heavily Banked
M
U
X
E
S
Central
Arithmetic
Logic Unit
P
D
C
E
PC
ARs
M
U
X
MAC A B ALU SHIFTER
EXTERNAL
MEMORY
INTERNAL
MEMORY
CNTL
Specialized Instruction Sets
 Base RISC ISA Plus CISC ISA Driven by End
Application





MAC
SAD
LMS
FIRS
Viterbi
 Support For Both Scalar and Vector Instructions
 Support For 8, 16 and 32-Bit Instructions
 Instructions are Highly Orthogonal
Scalar (55x) vs VLIW (64x)
 Scalar DSP’s Tend to be More CISC Like




Hurts Compiler Performance
Improves Energy-Delay
Improves Code Density
Limits Top End Performance
 VLIW DSP’s Tend to be More RISC Like
 RISC + GP Regs + Orthogonality Makes For a Good
C Compiler
 Assembler Code Is Challenging
 RISC ISA Allows for Higher Frequencies
 Load-Store Hurts Energy-Delay
TMS320C54x
TMS320C54x Protected Pipeline
CYCLES
P1 F1 D1 A1
P2 F2 D2
P3 F3
P4
R1
A2
D3
F4
P5
Fully loaded pipeline
X1
R2
A3
D4
F5
P6
X2
R3
A4
D5
F6
X3
R4 X4
A5 R5 X5
D6 A6 R6 X6
Prefetch: Calculate address of instruction
Fetch: Collect instruction
Decode: Interpret instruction
Access: Collect address of operand
Read: Collect operand
Execute: Perform operation
Note: Protected Pipeline Limits Micro-Architectural Flexibility and Performance
TMS320C6xx
’C6xx CPU Core
Program Fetch
Instruction Dispatch
Control
Registers
Instruction Decode
Data Path 1
Data Path 2
A Register File
B Register File
Control
Logic
Test
Emulation
L1 S1 M1 D1
Arithmetic
Logic
Unit
Auxiliary
Logic
Unit
D2 M2 S2 L2
Multiplier
Unit
Interrupts
TMS320C6xx Exposed Pipeline
Fetch
Decode
Execute
PG PS PW PR DP DC E1 E2 E3 E4 E5
 Fetch




PG
PS
PW
PR
 Decode
Program Address Generate
Program Address Send
Program Access Ready Wait
Program Fetch Packet Receive
 DP
 DC
Instruction Dispatch
Instruction Decode
 Execute
Execute Packet 1 PG PS PW PR DP DC
Execute Packet 2 PG PS PW PR DP
Execute Packet 3 PG PS PW PR
Execute Packet 4 PG PS PW
Execute Packet 5 PG PS
Execute Packet 6 PG
Execute Packet 7
 E1 - E5 Execute 1 through Execute 5
E1
DC
DP
PR
PW
PS
PG
E2
E1
DC
DP
PR
PW
PS
E3
E2
E1
DC
DP
PR
PW
E4
E3
E2
E1
DC
DP
PR
E5
E4
E3
E2
E1
DC
DP
Note: Exposed Pipeline Adds Risk to Programming Model
E5
E4
E3
E2
E1
DC
E5
E4
E3
E2
E1
E5
E4 E5
E3 E4 E5
E2 E3 E4 E5
Agenda
 Industry Trends
 DSP Architecture
 DSP Micro-Architecture
 DSP Systems
Micro-Architectural Challenges
 Accessing (Flat) On Chip Memory At Speed





Within 2-3 cycles
Feeding Multiple Functional Units From a Single
Register File
Running 600Mhz+ with a 7-9 Stage Pipeline
Linking Multiple Functional Units with Result
Forwarding
Implementing CISC Data-path to Meet Area and
Performance Goals
Achieving ARM Like Code Density
What Does and Doesn’t Work?
 Do









Banked Memory
Dual Access Memory
Full Custom Register Files
Split/Multiple Register Files
Custom/Semi-Custom Data-paths
Variable Length Instructions
CISC ISA
Co-Processors
Multi-Core






Multi-Level Caches
Super-Scalar
VLIW Packet Descriptors
Speculative Branching
Full Synthesis
Dynamic Logic
 Don’t
 Consider
 Multi-Threading
 uP with Co-Processors
Agenda
 Industry Trends
 DSP Architecture
 DSP Micro-Architecture
 DSP Systems
DSP Systems
Wireless
Infrastructure
Wired
WiredInfrastructure
Infrastructure
Performance
Audio
Digital
Still Client
Camera
Wireless
Wireless Infrastructure
6 DSP CPU
600 MHz
Viterbi
Viterbi
and
Turbo
and
Turbo
hardware
hardware
accelerators
accelerators
Wireless Client
@ 300MHz
6225
DSPMHz
CPU
DSP+GPP
3MB
24Mb
DSP+GPP
@ 300MHz
integrated
Imaging
600
MHz
memory
Floating
Low
power
3MB
integrated
accelerators
180M
point
consumption
memory
transistors
DSP+GPP
Low power
consumption
Voice, data,
video
Viterbi
Voice,
180Mdata,
transistors
TMS320C6416
TMS320C5561and Turbo
OMAP5910
video
hardware
acceleratorsAudio
Digital Still Camera
Performance
DSP+GPP
Imaging
TMS320C5561
accelerators
TMS320DM310
TMS320DA610
OMAP5910
TMS320C6416
TMS320DM310
225 MHz
Floating
point
TMS320DA610
VIOP Platform
 TNETV3010 Features
 6 C55x DSP @ 300 MHz
 Shared Instruction
Memory
 Broadcast DMA
 24M Bits of On Chip
SRAM
DaVinci Platform
DaVinci Block Diagram
DDR2
DDR2 RTM
RTM
EMIF
EMIF
PHY
PHY
3.0
3.0
2.0
2.0
(133MHz)
(133MHz)
S
CCD/
CMOS
Module
Or
NTSC/PAL
Decoder**
CCD/CMOS
CCD/CMOS
Video
Video
Interface
Interface
M
XDMA
XDMA
Preview
Preview
Engine
Engine
3A
3ACalc
Calc
M
eDMA
eDMATC
TC
(150MHz)
(150MHz)
Video Processing FE (150MHz)
OSD
OSD
Video
Video
Encoder
Encoder
32 bit
10b
10b
DAC
DAC
(x3)
(x3)
DDR266 16/32
Composite & Simul
Comp 480p
24b RGB/YUV
Video BE (27MHz)
Peripherals
64bit (150 MHz)
CFG bus 32bit
Nand
NandFlash
Flash/ /Smart
SmartMedia
Media
VBUSP (75MHz)
MS
MS//MS
MSPro
Pro
MMC
MMC//SD
SD
S
S
Im
ImgBuf
gBuf
ImgBuf
gBufIm
8KB
8KB 8KB
8KB
iMX+
iMX+
M/S
M/S
C64x
C64xDSP
DSP
(450MHz)
(450MHz)
80KB/80KB
80KB/80KB
L1
L1RAM
RAM
ARM926EJ-S
ARM926EJ-S
(225MHz)
(225MHz)
I-cache
I-cache
16KB
16KB
D-cache
D-cache
8KB
8KB
Image Processing Block
(225MHz)
RAM
RAM
16KB
16KB
ROM
ROM
8KB
8KB
Clocks
TC Bus
Periph Bus (VBUSP)
Config Bus (VBUSP)
Confidential released under NDA
JTAG
JTAG
I/F
I/F
JTAG
CLOCK
CLOCKctrl
ctrl
PLL(s)
PLL(s)
27MHz 24MHz(optional)
USB2.0 OTG/PHY
OTG/PHY
VBUSP (75MHz)
Seq
Seq
MUSB2.0
SPI
SPII/F
I/F(x2)
(x2)
UART
UART(x4)
(x4)
I2C
I2C
McBSP
McBSPAudio
AudioI/F
I/F
GIO
GIO
Timer/WDT
Timer/WDT(x6)
(x6)
PWM
PWM(x3)
(x3)
VLYNQ
VLYNQ(10-pin)
(10-pin)
ENET
ENETMAC
MAC
CFC/ATA
CFC/ATA
1394
1394
PDR
WTBU
TIH
TIJ
TID
BCG
Unkn
updated on 9/18/2003
OMAP Platform
 OMAP2420 Features
ARM11
+ VFP
TMS320C55x
DSP
2D/3D
Graphics
Accelerator
Imaging &
Video
Accelerator
(IVA)
Internal
SRAM
OMAP2420
 IVA supports still
Peripherals
Memory
Controller
accelerator
L4 Interconnect
Security
Camera
I/F
VFP (Vector Floating
Point), 32K/32K
I/Dcache
 DSP @ 220 MHz
 2D/3D graphics
L3 Interconnect
LCD
I/F
Video
Out
 ARM 1136 @ 330 MHz,
images
to >4 Mpixels, 30 fps
VGA video decode
 Output to TV for gaming
and video playback
 Encryption hardware for
DRM and security

Survey of Digital Signal Processors

Transcript Survey of Digital Signal Processors

Directory