Computing Engine Choices

Transcript Computing Engine Choices

•
•
Computing Engine Choices
General Purpose Processors (GPPs): Intended for general purpose computing
(desktops, servers, clusters..) General Purpose ISAs (RISC or CISC)
Application-Specific Processors (ASPs): Processors with ISAs and
architectural features tailored towards specific application domains
– E.g Digital Signal Processors (DSPs), Network Processors (NPs), Media Processors,
Graphics Processing Units (GPUs), Vector Processors??? ...
Special Purpose ISAs
•
•
Co-Processors: A hardware (hardwired) implementation of specific
algorithms with limited programming interface (augment GPPs or ASPs)
Configurable Hardware:
The ISA forms an abstraction layer
– Field Programmable Gate Arrays (FPGAs)
– Configurable array of simple processing elements
•
•
that sets the requirements for both
complier and CPU designers
Application Specific Integrated Circuits (ASICs): A custom VLSI hardware
solution for a specific computational task
The choice of one or more depends on a number of factors including:
- Type and complexity of computational algorithm
(general purpose vs. Specialized)
- Desired level of flexibility and programmability
- Performance requirements
- Desired level of computational efficiency
(e.g Computations per watt or computations per chip area)
- Power requirements
- Development time and cost
- Real-time constraints
- System cost
EECC722 - Shaaban
Repeated here from lecture 1
#1 lec # 8
Fall 2011 10-12-2011
Computing Engine Choices
Programmability / Flexibility
For Application-Specific Processors (ASPs):
Application Domain Requirements
ASP ISA
ASP Architecture
e.g Digital Signal Processors (DSPs),
Network Processors (NPs),
Media Processors,
Graphics Processing Units (GPUs)
Physics Processor ….
General Purpose
Processors
(GPPs):
Application-Specific
Processors (ASPs)
Processor = Programmable computing element that runs
programs written using a pre-defined set of instructions
Configurable Hardware
Selection Factors:
-Type and complexity of computational algorithm
(general purpose vs. Specialized)
- Desired level of flexibility and programmability
- Performance requirements
- Desired level of computational efficiency
- Power requirements
- Real-time constraints
- Development time and cost
- System cost
Repeated here from lecture 1
Software
Co-Processors
Application Specific
Integrated Circuits
(ASICs)
Specialization , Development cost/time
Performance/Chip Area/Watt (Computational Efficiency)
Hardware
EECC722 - Shaaban
#2 lec # 8
Fall 2011 10-12-2011
Why Application-Specific Processors (ASPs)?
Computing Element Choices Observation
• Generality and efficiency are in some sense inversely related
i.e computational efficiency
to one another:
– The more general-purpose a computing element is and thus the greater the
number of tasks it can perform, the less efficient (e.g. Computations per
chip area /watt) it will be in performing any of those specific tasks.
– Design decisions are therefore almost always compromises; designers
identify key features or requirements of applications that must be met and
and make compromises on other less important features.
• To counter the problem of computationally intense and
specialized problems for which general purpose
processors/machines cannot achieve the necessary
performance/other requirements:
ASPs
– Special-purpose processors (or Application-Specific Processors, ASPs) ,
attached processors, and coprocessors have been designed/built for many
years, for specific application domains, such as image or digital signal
processing (for which many of the computational tasks are specialized and
can be very well defined).
Generality = Flexibility = Programmability ?
Efficiency = Computational Efficiency
(Computations per watt or chip area)
EECC722 - Shaaban
#3 lec # 8
Fall 2011 10-12-2011
Digital Signal Processor (DSP) Architecture
•
•
•
•
•
•
•
•
•
•
DSP
Generations
Classification of Main Processor Types/Applications
Requirements of Embedded Processors DSPs are often embedded
DSP vs. General Purpose CPUs
DSP Cores vs. Chips
Classification of DSP Applications
DSP Algorithm Format
DSP Benchmarks
Basic Architectural Features of DSPs
DSP Software Development Considerations
Classification of Current DSP Architectures and example DSPs:
1-2 – Conventional DSPs: TI TMSC54xx
3
– Enhanced Conventional DSPs: TI TMSC55xx
4
– Multiple-Issue DSPs:
• VLIW DSPs: TI TMS320C62xx, TMS320C64xx
• Superscalar DSPs: LSI Logic ZSP400/500 DSP core
EECC722 - Shaaban
#4 lec # 8
Fall 2011 10-12-2011
•
General Purpose Computing & General Purpose Processors (GPPs) –
–
–
–
–
–
–
–
–
•
Embedded Processing: Embedded processors and processor cores
–
–
–
–
Cost, power code-size and real-time requirements and constraints
Once real-time constraints are met, a faster processor may not be better
e.g: Intel XScale, ARM, 486SX, Hitachi SH7000, NEC V800...
Often require Digital signal processing (DSP) support or other
16-32
application-specific support (e.g network, media processing)
Single or few specialized programs – known at system design time
Not end-user programmable
Real-time performance must be fully predictable (avoid dynamic arch. features)
Lightweight, often realtime OS or no OS
Examples: Cellular phones, consumer electronics .. …
bit
Microcontrollers
–
–
–
–
–
–
Extremely code size/cost/power sensitive
8 bit
Single program
Small word size - 8 bit common
Usually no OS
Highest volume processors by far
Examples: Control systems, Automobiles, industrial control, thermostats, ...
Examples of Application-Specific Processors (ASPs)
Increasing
volume
–
–
–
–
–
•
High performance: In general, faster is always better.
RISC or CISC: Intel P4, IBM Power4, SPARC, PowerPC, MIPS ...
64 bit
Used for general purpose software
End-user programmable
Real-time performance may not be fully predictable (due to dynamic arch. features)
Heavy weight, multi-tasking OS - Windows, UNIX
Normally, low cost and power not a requirement (changing)
Servers, Workstations, Desktops (PC’s), Notebooks, Clusters …
Increasing
Cost/Complexity
Main Processor Types/Applications
EECC722 - Shaaban
#5 lec # 8
Fall 2011 10-12-2011
The Processor Design Space
Performance
(Main Types)
Application specific
architectures
for performance
Embedded
Real-time constraints
processors
Microprocessors
Specialized applications
Low power/cost constraints
Microcontrollers
GPPs
Performance is
everything
& Software rules
Examples
of ASPs
Cost is everything
Chip Area, Power Processor Cost
complexity
EECC722 - Shaaban
#6 lec # 8
Fall 2011 10-12-2011
Requirements of Embedded Processors
•
Embedded
Processors:
How Fast?
•
•
•
•
Usually must meet strict real-time constraints:
– Real-time performance must be fully predictable:
• Avoid dynamic processor architectural features that make real-time
performance harder to predict ( e.g cache, dynamic scheduling, hardware
speculation …)
– Once real-time constraints are met, a faster processor is not desirable
(overkill) due to increased cost/power requirements.
Optimized for a single (or few) program (s) - code often in on-chip ROM or
on/off chip EPROM/flash memory.
Minimum code size (one of the motivations initially for Java)
Performance obtained by optimizing datapath
Low cost
– Lowest possible area
• High computational efficiency: Computation per unit area
Good or bad?
– VLSI implementation technology usually behind the leading edge
– High level of integration of peripherals (System-on-Chip -SoC- approach reduces
system cost/power)
•
Fast time to market
– Compatible architectures (e.g. ARM family) allows reusable code
– Customizable cores (System-on-Chip, SoC).
•
Low power if application requires portability
EECC722 - Shaaban
#7 lec # 8
Fall 2011 10-12-2011
Embedded Processors
Area of processor cores = Cost
(and Power requirements)
Thus need to minimize chip area
Embedded version
of a GPP
Nintendo processor
Cellular phones
EECC722 - Shaaban
#8 lec # 8
Fall 2011 10-12-2011
Embedded Processors
Another figure of merit: Computation per unit chip area
(Computational Efficiency)
Embedded version
of a GPP
Nintendo processor
Cellular phones
EECC722 - Shaaban
#9 lec # 8
Fall 2011 10-12-2011
Embedded Processors
•
•
How?
Code size
Smaller is better
If a majority of the chip is the program stored in ROM, then minimizing code size is a critical
issue
Common embedded processor ISA features to minimize code size:
1 – Variable length instruction encoding common:
• e.g. the Piranha has 3 sized instructions - basic 2 byte, and 2 byte plus 16 or 32 bit
immediate
2 – Complex/specialized instructions
CISC-Like ?
3 – Complex addressing modes
EECC722 - Shaaban
#10 lec # 8
Fall 2011 10-12-2011
Embedded Systems vs. General Purpose Computing
Embedded Systems
General Purpose Computing Systems
(and embedded processors)
(and processors GPPs)
Run a single or few specialized applications
often known at system design time
Used for general purpose software :
Intended to run a fully general set of applications
that may not be known at design time
May require application-specific capability
(e.g DSP)
Not end-user programmable
No application-specific capability required
Minimum code size is highly desirable
Lightweight, often real-time OS or no OS
Minimizing code size is not an issue
Heavy weight, multi-tasking OS - Windows, UNIX
Low power and cost constraints/requirements
Higher power and cost constraints/requirements
Usually must meet strict real-time constraints
–(e.g. real-time sampling rate)
Thus
In general, no real-time constraints
Real-time performance must be fully
predictable:
Real-time performance may not be fully predictable
(due to dynamic processor architectural features):
•Avoid dynamic processor architectural features
that make real-time performance harder to
predict
Once real-time constraints are met, a faster
processor is not desirable (overkill) due to
increased cost/power requirements.
End-user programmable
Thus
•Superscalar: dynamic scheduling, hardware
speculation, branch prediction, cache.
Faster (higher-performance) is always better
usually
EECC722 - Shaaban
#11 lec # 8
Fall 2011 10-12-2011
Evolution of GPPs and DSPs
•
General Purpose Processors (GPPs) trace roots back to Eckert, Mauchly, Von
Neumann (ENIAC) + EDSAC First generation processors
•
Digital Signal Processors (DSPs) are microprocessors designed for efficient
mathematical manipulation of digital signals utilizing digital signal processing
algorithms.
– DSPs usually process infinite continuous sampled data streams (physical
signals) while meeting real-time and power constraints.
– DSPs evolved from Analog Signal Processors (ASPs) that utilize analog
hardware to transform physical signals (classical electrical engineering)
– ASP to DSP because:
i.e.
• DSP insensitive to environment (e.g., same response in snow or desert if it
works at all)
• DSP performance identical even with variations in components; 2 analog
systems behavior varies even if built with same components with 1% variation
•
Different history and different applications requirements led to different ISA
design considerations, terms, different metrics, architectures, some new
inventions.
For Application-Specific Processors (ASPs):
Application Domain Requirements
ASP ISA
ASP Architecture
EECC722 - Shaaban
#12 lec # 8
Fall 2011 10-12-2011
DSP vs. General Purpose CPUs
• DSPs tend to run one (or few) program(s), not many programs.
– Hence OSes (if any) are much simpler, there is no virtual memory or protection,
...
– DSP must meet application signal sampling rate computational requirements:
DSP
Performance
Requirements
• Once above real-time constraints are met, a faster DSP is overkill (higher
DSP cost, power..) without additional benefit.
– You must account for anything that could happen in a time slot (DSP algorithm
inner-loop, data sampling rate)
– All possible interrupts or exceptions must be accounted for and their collective
time be subtracted from the time interval.
• Therefore, exceptions are BAD.
• DSPs usually process infinite continuous data streams:
– Requires high memory bandwidth (with predictable latency, e.g no data
cache) for streaming real-time data samples and predictable processing
time on the data samples
• The design of DSP ISAs and processor architectures is driven by the
requirements of DSP algorithms.
– Thus DSPs are application-specific processors
DSP Algorithms
DSP ISAs
DSP Architectures
EECC722 - Shaaban
#13 lec # 8
Fall 2011 10-12-2011
Similar to other embedded processors
• DSPs usually run applications with hard real-time constraints:
DSP vs. GPP
• The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate
(MAC). i.e Main performance measure of DSPs is MAC speed
Why?
– MAC is common in DSP algorithms that involve computing a vector dot
product, such as digital filters, correlation, and Fourier transforms.
– DSP are judged by whether they can keep the multipliers busy 100% of the
time and by how many MACs are performed in each cycle.
• The "SPEC" of DSPs is 4 algorithms:
–
–
–
–
Inifinite Impule Response (IIR) filters
Finite Impule Response (FIR) filters
FFT, and
convolvers
Since DSPS are application domain specific processors
• In DSPs, target algorithms are important:
– Binary compatibility not a major issue
unlike general purpose
• High-level Software is not as important in DSPs as in GPPs.
– People still write in assembly language for a product to minimize
the die area for ROM in the DSP chip.
Code size
Note: While this is still mostly true, however, programming for DSPs in high
level languages (HLLs) has been gaining more acceptance due to the
development of more efficient HLL DSP compilers in recent years.
EECC722 - Shaaban
#14 lec # 8
Fall 2011 10-12-2011
Types of DSP Processors
According to type of Arithmetic/operand Size Supported
• 32-BIT FLOATING POINT (5% of DSP market):
–
–
–
–
TI TMS320C3X, TMS320C67xx (VLIW)
AT&T DSP32C
ANALOG DEVICES ADSP21xxx
Hitachi SH-4
Examples
• 16-BIT FIXED POINT (95% of DSP market):
–
–
–
–
–
–
–
TI TMS320C2X, TMS320C62xx (VLIW)
Infineon TC1xxx (TriCore1) (VLIW)
MOTOROLA DSP568xx, MSC810x (VLIW)
ANALOG DEVICES ADSP21xx
Agere Systems DSP16xxx, Starpro2000
LSI Logic LSI140x (ZPS400) superscalar
Hitachi SH3-DSP
Examples
– StarCore SC110, SC140 (VLIW)
EECC722 - Shaaban
#15 lec # 8
Fall 2011 10-12-2011
DSP Cores vs. Chips
DSP are usually available as synthesizable cores or off-theshelf packaged chips
• Synthesizable Cores: IP
– Map into chosen fabrication process
• Speed, power, and size vary
– Choice of peripherals, etc. (SoC) SOC = System On Chip
– Requires extensive hardware development effort.
Resulting in more development time and cost (very high volume needed to justify development cost
• Off-the-shelf packaged chips:
–
–
–
–
–
Highly optimized for speed, energy efficiency, and/or cost.
Lower development time/cost/effort.
Tools, 3rd-party support often more mature.
Faster time to market.
Limited performance, integration options.
EECC722 - Shaaban
#16 lec # 8
Fall 2011 10-12-2011
DSP ARCHITECTURE
Enabling Technologies
Time Frame
Early 1970’s
Approach

Primary Application
Enabling Technologies
Discrete logic




Non-real time
processing
Simulation
Military radars
Digital Comm.


Bipolar SSI, MSI
FFT algorithm


Single chip bipolar multiplier
Flash A/D
Late 1970’s

Building block
1
Early 1980’s

Single Chip DSP mP


Telecom
Control


mP architectures
NMOS/CMOS
2
Late 1980’s

Function/Application
specific chips


Computers
Communication


Vector processing
Parallel processing
3
Early 1990’s

Multiprocessing

Video/Image Processing 

4
Late 1990’s

Single-chip
multiprocessing


Wireless telephony
Internet related
First microprocessor DSP
TI TMS 32010


Advanced multiprocessing
VLIW, MIMD, etc.
Low power single-chip DSP
VLIW/Multiprocessing
Generations of single-chip (microprocessor) DSPs
EECC722 - Shaaban
#17 lec # 8
Fall 2011 10-12-2011
Texas Instruments TMS320 Family
Multiple DSP mP Generations
First
Sample
Bit Size
Clock
speed
(MHz)
Instruction
Throughput
MAC
execution
(ns)
MOPS
Device density (#
of transistors)
Uniprocessor
Based
(Harvard
Architecture)
1
2
3
4
TMS32010
1982
16 integer
20
5 MIPS
400
5
58,000 (3m)
TMS320C25
1985
16 integer
40
10 MIPS
100
20
160,000 (2m)
TMS320C30
1988
32 flt.pt.
33
17 MIPS
60
33
695,000 (1m)
TMS320C50
1991
16 integer
57
29 MIPS
35
60
1,000,000 (0.5m)
TMS320C2XXX
1995
16 integer
40 MIPS
25
80
MIMD
5
2 GOPS
120 MFLOP
20 GOPS
5
1 GFLOP
VLIW
Multiprocessor (VLIW)
Based
TMS320C80
1996
32 integer/flt.
TMS320C62XX
1997
16 integer
TMS310C67XX
1997
32 flt. pt.
1600 MIPS
Generations of single-chip (microprocessor) DSPs
VLIW
EECC722 - Shaaban
#18 lec # 8
Fall 2011 10-12-2011
DSP Applications
•
•
•
•
•
•
Digital audio applications
– MPEG Audio
– Portable audio
Digital cameras
Cellular telephones
Wearable medical appliances
Storage products:
– disk drive servo control
Military applications:
– radar
– sonar
• Industrial control
• Seismic exploration
• Networking:
(Telecom infrastructure)
– Wireless
– Base station
– Cable modems
– ADSL
– VDSL
– …...
Current DSP Killer Applications: Cell phones and telecom infrastructure
HDTV? ….. Other?
EECC722 - Shaaban
#19 lec # 8
Fall 2011 10-12-2011
DSP Algorithms & Applications
DSP Algorithm
Speech Coding
Speech Encryption
Speech Recognition
Speech Synthesis
Speaker Identification
High-fidelity Audio
Modems
Noise cancellation
Audio Equalization
Ambient Acoustics Emulation
Audio Mixing/Editing
Sound Synthesis
Vision
Image Compression
Image Compositing
Beamforming
Echo cancellation
Spectral Estimation
System Application
Digital cellular telephones, personal communications systems, digital cordless telephones,
multimedia computers, secure communications.
Digital cellular telephones, personal communications systems, digital cordless telephones,
secure communications.
Advanced user interfaces, multimedia workstations, robotics, automotive applications,
cellular telephones, personal communications systems.
Advanced user interfaces, robotics
Security, multimedia workstations, advanced user interfaces
Consumer audio, consumer video, digital audio broadcast, professional audio, multimedia
computers
Digital cellular telephones, personal communications systems, digital cordless telephones,
digital audio broadcast, digital signaling on cable TV, multimedia computers, wireless
computing, navigation, data/fax
Professional audio, advanced vehicular audio, industrial applications
Consumer audio, professional audio, advanced vehicular audio, music
Consumer audio, professional audio, advanced vehicular audio, music
Professional audio, music, multimedia computers
Professional audio, music, multimedia computers, advanced user interfaces
Security, multimedia computers, advanced user interfaces, instrumentation, robotics,
navigation
Digital photography, digital video, multimedia computers, videoconferencing
Multimedia computers, consumer video, advanced user interfaces, navigation
Navigation, medical imaging, radar/sonar, signals intelligence
Speakerphones, hands-free cellular telephones
Signals intelligence, radar/sonar, professional audio, music
EECC722 - Shaaban
#20 lec # 8
Fall 2011 10-12-2011
Another Look at DSP Applications
–
–
–
–
Increasing
Cost
• High-end:
Military applications (e.g. radar/sonar)
Wireless Base Station - TMS320C6000
Cable modem
Gateways - HDTV …
• Mid-range:
–
–
–
Industrial control
Cellular phone - TMS320C540
Fax/ voice server …
–
–
–
–
–
–
Increasing
volume
• Low end:
Storage products - TMS320C27 (hard drive controllers)
Digital camera - TMS320C5000
Portable phones
Wireless headsets
Consumer audio
Automobiles, thermostats, ...
EECC722 - Shaaban
#21 lec # 8
Fall 2011 10-12-2011
DSP range of applications
& Possible Target DSPs
EECC722 - Shaaban
#22 lec # 8
Fall 2011 10-12-2011
Cellular Phone System
123
456
789
0
PHYSICAL
LAYER
PROCESSING
A/D
415-555-1212
CONTROLLER
SPEECH
ENCODE
BASEBAND
CONVERTER
SPEECH
DECODE
Example DSP Application
RF
MODEM
DAC
EECC722 - Shaaban
#23 lec # 8
Fall 2011 10-12-2011
Cellular Phone: HW/SW/IC Partitioning
MICROCONTROLLER
123
456
789
0
ASIC
A/D
415-555-1212
CONTROLLER
PHYSICAL
LAYER
PROCESSING
SPEECH
ENCODE
BASEBAND
CONVERTER
SPEECH
DECODE
RF
MODEM
DAC
DSP
ANALOG IC
Example DSP Application
EECC722 - Shaaban
#24 lec # 8
Fall 2011 10-12-2011
Mapping Onto System-on-Chip (SoC)
(Cellular Phone)
S/P
Micro-controller or embedded processor
RAM
RAM
speech
quality
DSP Core
book
intfc
µC
DMA
ASIC
LOGIC
keypad
control protocol
DMA
S/P
phone
voice
recognition
enhancment
DSP
CORE
de-intl &
RPE-LTP
decoder
speech decoder
demodulator
and
synchronizer
Example DSP Application
Viterbi
equalizer
EECC722 - Shaaban
#25 lec # 8
Fall 2011 10-12-2011
Example Cellular Phone Organization
C540
(DSP)
ARM7
(µC)
Example DSP Application
EECC722 - Shaaban
#26 lec # 8
Fall 2011 10-12-2011
Multimedia System-on-Chip (SoC)
e.g. Multimedia terminal electronics
Graphics Out
Video I/O
Downlink Radio
Voice I/O
ASIC
Co-processor
Or ASP
Pen In
• Future chips will be a mix of
processors, memory and
dedicated hardware for
specific algorithms and I/O
Example DSP Application
µP
Video Unit
(ASIC)
Memory
Coms
Uplink Radio
custom
DSP
EECC722 - Shaaban
#27 lec # 8
Fall 2011 10-12-2011
DSP Algorithm Format
• DSP culture has a graphical format to represent
formulas. i.e. DSP algorithms
• Like a flowchart for formulas, inner loops,
not programs.
• Some seem natural:
 is add, X is multiply
• Others are obtuse:
z–1 means take variable from earlier iteration (delay).
• These graphs are trivial to decode
EECC722 - Shaaban
#28 lec # 8
Fall 2011 10-12-2011
DSP Algorithm Notation
• Uses “flowchart” notation instead of equations
• Multiply is
or
X
• Add
is
• Delay/Storage
or
+

is
or
or
Delay
z–1
D
EECC722 - Shaaban
#29 lec # 8
Fall 2011 10-12-2011
Typical DSP Algorithm:
Finite-Impulse Response (FIR) Filter
• Filters reduce signal noise and enhance image or signal
quality by removing unwanted frequencies.
• Finite Impulse Response (FIR) filters compute:
N 1
Filter coefficients
y (i)   h(k ) x(i  k )  h(n) * x(n)
where
–
–
–
–
k 0
N Taps
Signal samples
x is the input sequence
Vector Dot Product:
Multiply Accumulate (MAC) Operations
y is the output sequence
h is the impulse response (filter coefficients)
N is the number of taps (coefficients) in the filter
• Output sequence depends only on input sequence and
impulse response.
i.e filter coefficients
EECC722 - Shaaban
#30 lec # 8
Fall 2011 10-12-2011
Typical DSP Algorithms:
Finite-impulse Response (FIR) Filter
•
•
•
•
N most recent samples in the delay line (Xi)
New sample moves data down delay line
Filter “Tap” is a multiply-add (Multiply And Accumulate, MAC)
Each tap (N taps total) nominally requires:
– Two data fetches
Requires real-time data sample streaming
• Predictable data bandwidth/latency
• Special addressing modes
• Separate memory banks/busses?
– Multiply
Repetitive computations, multiply and accumulate (MAC)
MAC
– Accumulate
• Requires efficient MAC support
– Memory write-back to update delay line
• Special addressing modes (e.g modulo)
Performance Goal: At least 1 FIR Tap / DSP instruction cycle
EECC722 - Shaaban
#31 lec # 8
Fall 2011 10-12-2011
Signal
Samples
X
FINITE-IMPULSE RESPONSE (FIR) FILTER
Delay (accumulator register)
Z 1
h0
Z 1
MAC
A Filter Tap
One FIR Filter Tap
hN-1
hN-2
h1
Filter
Coefficients
Z 1
....
Y
N 1
Filter coefficients
Delayed
samples
y (i)   h(k ) x(i  k )
k 0
i.e. Vector dot product
Performance Goal: at least 1 FIR Tap / DSP instruction cycle
DSP must meet application signal sampling rate computational requirements:
A faster DSP is overkill (more cost/power than really needed)
EECC722 - Shaaban
#32 lec # 8
Fall 2011 10-12-2011
Sample Computational Rates
for FIR Filtering
FIR
Signal type
Type
Frequency # taps
Performance
1-D Speech
8 kHz
N =128
20 MOPs
1-D Music
48 kHz
N =256
24 MOPs
2-D Video phone 6.75 MHz
N*N = 81 1,090 MOPs
2-D TV
N*N = 81 4,370 MOPs
27 MHz
(4.37 GOPs)
2-D HDTV
144 MHz
N*N = 81 23,300 MOPs
(23.3 GOPs)
OPs = Operation Per Second
1-D FIR has nop = 2N and a 2-D FIR has nop = 2N2.
DSP must meet application signal sampling rate computational requirements:
• A faster DSP is overkill (higher DSP cost, power..)
EECC722 - Shaaban
DSP Performance Requirements
#33 lec # 8
Fall 2011 10-12-2011
FIR filter on (simple)
General Purpose Processor
loop:
lw x0, 0(r0)
lw y0, 0(r1)
mul a, x0,y0
add y0,a,b
sw y0,(r2)
inc r0
inc r1
inc r2
dec ctr
tst ctr
jnz loop
• Problems:
+
+ GPP Real-time performance may (to meet signal sampling
rate) not be fully predictable (due to dynamic processor
architectural features):
•Superscalar: dynamic scheduling, hardware speculation,
branch prediction, cache.
• Bus / memory bandwidth bottleneck,
• control/loop code overhead
• No suitable addressing modes, instructions – e.g. multiply and accumulate (MAC) instruction
EECC722 - Shaaban
#34 lec # 8
Fall 2011 10-12-2011
Typical DSP Algorithms:
Infinite-Impulse Response (IIR) Filter
• Infinite Impulse Response (IIR) filters compute:
y(i) 
M 1
N 1
 a(k ) y(i  k )   b(k ) x(i  k )
k 1
MAC
k 0
MAC
• Output sequence depends on input sequence, previous
outputs, and impulse response.
i.e Filter coefficients: a(k), b(k)
• Both FIR and IIR filters
– Require vector dot product (multiply-accumulate)
operations
MAC
– Use fixed coefficients normally
• Adaptive filters update their coefficients to minimize
the distance between the filter output and the desired
signal.
EECC722 - Shaaban
#35 lec # 8
Fall 2011 10-12-2011
Typical DSP Algorithms:
Discrete Fourier Transform (DFT)
• The Discrete Fourier Transform (DFT) allows for
spectral analysis in the frequency domain.
• It is computed as MAC
N 1
y(k )  WN nk x(n)
n 0
WN
for k = 0, 1, … , N-1, where
2 j
e N
j  1
Time Domain
Frequency Domain
– x is the input sequence in the time domain
– y is an output sequence in the frequency domain
• The Inverse Discrete Fourier Transform is
MAC
N 1
computed as
x(n)  WN nk y(k ), for n  0, 1, ... , n - 1
k 0
• The Fast Fourier Transform (FFT) provides an
efficient method for computing the DFT.
EECC722 - Shaaban
#36 lec # 8
Fall 2011 10-12-2011
Typical DSP Algorithms:
Discrete Cosine Transform (DCT)
• The Discrete Cosine Transform (DCT) is frequently used
in image & video compression (e.g. JPEG, MPEG-2).
• The DCT and Inverse DCT (IDCT) are computed as:
MAC
(2n  1)k
y(k )  e(k )  cos[
]x(n), for k  0, 1, ... N - 1
2N
n 0
N 1
2
x ( n) 
N
MAC
(2n  1)k
 e(k ) cos[ 2 N ] y(n), for k  0, 1, ... N -1
k 0
N 1
where e(k) = 1/sqrt(2) if k = 0; otherwise e(k) = 1.
• A N-Point, 1D-DCT requires N2 MAC operations.
EECC722 - Shaaban
#37 lec # 8
Fall 2011 10-12-2011
DSP BENCHMARKS
• DSPstone: University of Aachen, application benchmarks
–
–
–
–
ADPCM TRANSCODER - CCITT G.721, REAL_UPDATE, COMPLEX_UPDATES
DOT_PRODUCT, MATRIX_1X3, CONVOLUTION
FIR, FIR2DIM, HR_ONE_BIQUAD
LMS, FFT_INPUT_SCALED
• BDTImark2000: Berkeley Design Technology Inc BDTI
– 12 DSP kernels in hand-optimized assembly language:
• FIR, IIR, Vector dot product, Vector add, Vector maximum, FFT ….
– Returns single number (higher means faster) per processor
– Use only on-chip memory (memory bandwidth is the major bottleneck in
performance of embedded applications).
• EEMBC (pronounced “embassy”): EDN Embedded
Microprocessor Benchmark Consortium
– 30 companies formed by Electronic Data News (EDN)
– Benchmark evaluates compiled C code on a variety of embedded processors
(microcontrollers, DSPs, etc.)
– Application domains: automotive-industrial, consumer, office automation,
networking and telecommunications
EECC722 - Shaaban
#38 lec # 8
Fall 2011 10-12-2011
4th Generation
3rd
Generation
2nd
Generation
> 800x
Faster than
first generation
1st
Generation
DSPs from generations 2, 3 and 4 are in use today. Why?
EECC722 - Shaaban
#39 lec # 8
Fall 2011 10-12-2011
Basic DSP ISA/Architectural Features
Specialized DSP Algorithms/Application Requirements
•
DSP ISAs
DSP Architectures
Data path configured for DSP algorithms
– Fixed-point arithmetic (most DSPs) DSP ISA Feature
• Modulo arithmetic (saturation to handle overflow)
– MAC- Multiply-accumulate unit(s)
– Hardware rounding support
DSP Architectural Features
•
Multiple memory banks and buses – Harvard Architecture
– Multiple data memories/buses
DSP ISA Feature
•
Specialized addressing modes
– Bit-reversed addressing
–
DSP ISA Feature
•
Circular buffers
Usually with no data cache
for predictable fast data sample
streaming
DSP Architectural Feature
Dedicated address generation units
are usually used
Specialized instruction set and execution control
– Zero-overhead loops
– Support for fast MAC
– Fast Interrupt Handling
•
DSP Architectural Feature
To meet real-time signal
sampling/processing constraints
Specialized peripherals for DSP
- (System on Chip - SoC style)
DSP Architectural Feature
EECC722 - Shaaban
#40 lec # 8
Fall 2011 10-12-2011
DSP ISA Features
DSP Data Path: Arithmetic
Most Common: Fixed Point (16-bit) + Integer Arithmetic
• DSPs dealing with numbers representing real world signals
=> Want “reals”/ fractions Fixed-point
• DSPs dealing with numbers for addresses
=> Want integers
Thus
• DSP ISA (and DSP) must Support “fixed point” as well as
integers
S
-1 Š x < 1
radix
point
In DSP ISAs: Fixed-point arithmetic must
be supported, floating point support is
optional and is much less common
S
–2N–1 Š x < 2N–1
.
Usually 16-bit fixed-point
DSP ISA Feature
.
radix
point
Much Less Common: Single Precision Floating-point Support
EECC722 - Shaaban
#41 lec # 8
Fall 2011 10-12-2011
DSP ISA Features
DSP Data Path: Precision
16-bit Fixed-Point Most Common
Single
Precision
• Word size affects precision of fixed point numbers
• DSPs have 16-bit, 20-bit, or 24-bit data words 16-bit most common
• Floating Point DSPs cost 2X - 4X vs. fixed point, slower
In DSP ISAs: Fixed-point arithmetic must be supported, floating point
than fixed point
(single precision) support is optional and is much less common
• DSP programmers will scale values inside code
– SW Libraries
– Separate explicit exponent
• “Blocked Floating Point” single exponent for a group of
fractions
• Floating point support simplify development for high-end
DSP applications.
EECC722 - Shaaban
#42 lec # 8
Fall 2011 10-12-2011
DSP ISA Feature
DSP Data Path: Overflow Handling
• DSP are descended from analog signal processors:
– Modulo Arithmetic.
• Set to most positive (2N–1–1) or
most negative value(–2N–1) : “saturation”
• Many DSP algorithms were developed in this
model.
2N–1–1
Saturation
Why Support?
Due to physical
nature of signals
Saturation
–2N–1
EECC722 - Shaaban
#43 lec # 8
Fall 2011 10-12-2011
DSP Architectural Features
DSP Data Path: Specialized Hardware
• Fast specialized hardware functional units performs all
key arithmetic operations in 1 cycle, including:
–
–
–
–
–
Shifters
To help meet real-time constraints
Saturation
for commonly needed operations
Guard bits
Rounding modes
Multiplication/addition (MAC)
• 50% of instructions can involve multiplier
=> single cycle latency multiplier
• Need to perform multiply-accumulate (MAC) fast
• n-bit multiplier => 2n-bit product
i.e. must optimize
common operations
EECC722 - Shaaban
#44 lec # 8
Fall 2011 10-12-2011
DSP Data Path: Multiply Accumulate (MAC) Unit
One or more MAC units
• Don’t want overflow or have to scale accumulator
• Option 1: accumalator wider than product:
“guard bits”
– Motorola DSP:
24b x 24b => 48b product, 56b Accumulator
• Option 2: shift right and round product before adder
Multiplier
Multiplier
Shift
ALU add
Accumulator G
ALU
add
}
MAC
Unit
Accumulator
EECC722 - Shaaban
#45 lec # 8
Fall 2011 10-12-2011
DSP Data Path: Rounding Modes
• Even with guard bits, will need to round when storing
accumulator into memory
• 3 DSP standard options (supported in hardware)
1 • Truncation: chop results
Not in software as in GPPs
=> biases results up
2 • Round to nearest:
< 1/2 round down, •
1/2 round up (more positive)
=> smaller bias
3 • Convergent:
< 1/2 round down, > 1/2 round up (more positive), =
1/2 round to make lsb a zero (+1 if 1, +0 if 0)
=> no bias
IEEE 754 calls this round to nearest even
EECC722 - Shaaban
#46 lec # 8
Fall 2011 10-12-2011
Data Path Comparison
DSP Processor
• Specialized hardware
performs all key arithmetic
operations in 1 cycle.
– e.g MAC
• Hardware support for
managing numeric fidelity:
– Shifters
– Guard bits
– Saturation
– Rounding modes
General-Purpose Processor
• Multiplies often take>1
cycle
• Shifts often take >1 cycle
• Other operations (e.g.,
saturation, rounding)
typically take multiple
cycles.
EECC722 - Shaaban
#47 lec # 8
Fall 2011 10-12-2011
TI 320C54x DSP (1995) Functional Block Diagram
Multiple memory
banks and buses
MAC
Unit
Hardware support for rounding/saturation
EECC722 - Shaaban
#48 lec # 8
Fall 2011 10-12-2011
First Commercial DSP (1982): Texas
Instruments TMS32010
• 16-bit fixed-point arithmetic
• Introduced at 5Mhz (200ns)
instruction cycle.
• “Harvard architecture”
– separate instruction,
data memories
Instruction
Memory
Processor
Data
Memory
Datapath:
Mem
T-Register
• Accumulator i.e MAC Unit
• Specialized instruction set
– Load and Accumulate
• Two-cycle (400 ns) MultiplyAccumulate (MAC) time.
Multiplier
ALU
P-Register
Accumulator
EECC722 - Shaaban
#49 lec # 8
Fall 2011 10-12-2011
First Generation DSP mP
Texas Instruments TMS32010 - 1982
Features
•
•
•
•
•
•
•
•
•
•
200 ns instruction cycle (5 MIPS)
144 words (16 bit) on-chip data RAM
1.5K words (16 bit) on-chip program ROM - TMS32010
External program memory expansion to a total of 4K words at full speed
16-bit instruction/data word
single cycle 32-bit ALU/accumulator
Single cycle 16 x 16-bit multiply in 200 ns
Two cycle MAC (5 MOPS)
Zero to 15-bit barrel shifter
Eight input and eight output channels
EECC722 - Shaaban
#50 lec # 8
Fall 2011 10-12-2011
First Generation DSP mP TI TMS32010
Block Diagram
Program Memory
(ROM/EPROM)
Data/Samples
Memory
MAC
Unit
Barrel Shifter (1 cycle)
EECC722 - Shaaban
#51 lec # 8
Fall 2011 10-12-2011
TMS32010 FIR Filter Code
• Here X4, H4, ... are direct (absolute) memory addresses:
LT X4
; Load T with x(n-4)
MPY H4 ; P = H4*X4
LTD X3 ; Load T with x(n-3); x(n-4) = x(n-3);
; Acc = Acc + P
MPY H3 ; P = H3*X3
LTD X2
Load and Accumulate
MPY H2
...
• Two instructions per tap, but requires unrolling
EECC722 - Shaaban
#52 lec # 8
Fall 2011 10-12-2011
DSP Architectural Features
DSP Memory
• FIR Tap implies multiple memory accesses
• DSPs require multiple data ports Separate memories for data, program
• Some DSPs have ad hoc techniques to reduce memory
bandwdith demand:
– Instruction repeat buffer: do 1 instruction 256 times
– Often disables interrupts, thereby increasing interrupt
response time
• Some recent DSPs have instruction caches
– Even then may allow programmer to “lock in”
instructions into cache
– Option to turn cache into fast program memory
• Usually DSPs have no data caches.
• May have multiple data memories
e.g one for signal data samples and one for filter coefficients
Why?
For better
real-time
performance
predictability
EECC722 - Shaaban
#53 lec # 8
Fall 2011 10-12-2011
Conventional “Von Neumann’’ memory
AKA unified or Princeton memory architecture
EECC722 - Shaaban
#54 lec # 8
Fall 2011 10-12-2011
HARVARD MEMORY ARCHITECTURE in DSP
(i.e. split memory)
e.g one for signal data samples and one for filter coefficients
ROM/EPROM/
FLASH?
Data Memory Banks (SRAM)
PROGRAM
MEMORY
X MEMORY
Y MEMORY
GLOBAL
P DATA
X DATA
Y DATA
Multiple memory
banks and buses
EECC722 - Shaaban
#55 lec # 8
Fall 2011 10-12-2011
Memory Architecture Comparison
•
•
•
DSP Processor
Harvard architecture (split)
2-4 memory accesses/cycle
No caches: on-chip SRAM
For real-time performance
predictability
•
•
•
General-Purpose Processor
Von Neumann architecture
Typically 1 access/cycle
Use caches
i.e. unified memory
but not L1-cache
(split)
Makes real-time performance
harder to predict
Program
Memory
Processor
Processor
Memory
Data
Memory
EECC722 - Shaaban
#56 lec # 8
Fall 2011 10-12-2011
TI TMS320C3x MEMORY BLOCK DIAGRAM - Harvard Architecture
Instruction
Cache
Multiple memory
banks and buses
Data
Data
Program
Multiple memory
banks and buses
EECC722 - Shaaban
#57 lec # 8
Fall 2011 10-12-2011
TI 320C62x/67x DSP (1997) – (Fourth Generation DSP)
Program
Data
EECC722 - Shaaban
#58 lec # 8
Fall 2011 10-12-2011
DSP ISA Features
DSP Addressing Modes
Complex &
Specialized
• Have standard addressing modes: immediate, displacement,
register indirect
• Want to keep MAC datapath busy.
• Assumption: any extra instructions imply additional clock cycles
of overhead in inner loop and larger code size
=> Thus complex addressing is good
Why?
Examples:
To match data access patterns in DSP algorithms
and reduce number of instructions (code size)
• Autoincrement/Autodecrement register indirect
– lw r1,0(r2)+ => r1 <- M[r2]; r2<-r2+1
– Option to do it before addressing, positive or negative
• “bit reverse” address addressing mode.
• “modulo” or “circular” addressing
=> Don’t use normal datapath integer unit to calculate complex
addressing modes:
– Instead use dedicated address generation units.
Related DSP Architectural Feature
EECC722 - Shaaban
#59 lec # 8
Fall 2011 10-12-2011
DSP ISA Features
DSP Addressing: FFT
• FFTs start or end with data in bufferfly order
0 (000)
=>
0 (000)
1 (001)
=>
4 (100)
2 (010)
=>
2 (010)
Bit Reversed
3 (011)
=>
6 (110)
Addressing
4 (100)
=>
1 (001)
5 (101)
=>
5 (101)
6 (110)
=>
3 (011)
7 (111)
=>
7 (111)
• How to avoid overhead of address checking instructions for FFT?
• Have an optional “bit reverse” address addressing mode for use with
autoincrement addressing
• Thus most DSPs have “bit reverse” addressing for radix-2 FFT
EECC722 - Shaaban
#60 lec # 8
Fall 2011 10-12-2011
DSP ISA Features
Bit Reversed Addressing
000
x(0)
F(0)
100
x(4)
F(1)
010
x(2)
F(2)
110
x(6)
F(3)
001
x(1)
F(4)
101
x(5)
F(5)
011
x(3)
F(6)
111
x(7)
F(7)
Four 2-point
DFTs
Two 4-point
DFTs
One 8-point DFT
Data flow in the radix-2 decimation-in-time FFT algorithm
EECC722 - Shaaban
#61 lec # 8
Fall 2011 10-12-2011
DSP Addressing: Circular Buffers
and addressing
• DSPs dealing with continuous I/O Sampled signal
• Often interact with an I/O buffer (delay lines)
• To save memory, buffers often organized as circular
buffers
• What can do to avoid overhead of address checking
instructions for circular buffer?
• Option 1: Keep start register and end register per
address register for use with autoincrement addressing,
reset to start when reach end of buffer
• Option 2: Keep a buffer length register, assuming
Circular
Buffer
buffers starts on aligned address, reset to start when
addressing
reach end
• Every DSP has “modulo” or “circular” addressing
EECC722 - Shaaban
#62 lec # 8
Fall 2011 10-12-2011
DSP ISA Features
Circular Buffers Addressing Support
Every DSP has “modulo” or
“circular” addressing mode
e.g. from A/D
Instructions accommodate three
elements:
• Buffer address
• Buffer size
Why?
• Increment
Allows for cycling through:
• delay elements (signal samples)
e.g. to D/A
• Filter coefficients in data memory
Or other DSP algorithm coefficients
EECC722 - Shaaban
#63 lec # 8
Fall 2011 10-12-2011
DSP Architectural Features
Address calculation for DSPs
DSP
• Dedicated address Do not use normal
integer unit
generation units
• Supports modulo and bit
reversal arithmetic
• Often duplicated to
calculate multiple
addresses per cycle
EECC722 - Shaaban
#64 lec # 8
Fall 2011 10-12-2011
Addressing Comparison
DSP Architectural
Feature
DSP Processor
• Dedicated address
generation units
• Specialized addressing
DSP ISA Feature
modes; e.g.:
– Autoincrement
– Modulo (circular)
– Bit-reversed (for FFT)
• Good immediate data
support
General-Purpose Processor
• Often, no separate address
generation units
• General-purpose addressing
modes GPP ISA Feature
Number minimized
In RISC ISAs
EECC722 - Shaaban
#65 lec # 8
Fall 2011 10-12-2011
DSP ISA Features
DSP Instructions and Execution
• May specify multiple operations in a single complex
instruction:
To reduce number of instructions
– e.g. A compound instruction may perform: and reduce code size
multiply + add + load + modify address register
• Must support Multiply-Accumulate (MAC)
• Need parallel move support
• Usually have special loop support to reduce branch overhead
Reduce loop overhead
– Loop an instruction or sequence
– 0 value in register usually means loop maximum number of
times
– Must be sure if calculate loop count that 0 does not mean 0
• May have saturating shift left arithmetic
• May have conditional execution to reduce branches
In 4th generation VLIW DSPs
EECC722 - Shaaban
#66 lec # 8
Fall 2011 10-12-2011
DSP ISA Features
DSP Low/Zero Overhead Loops
Examples
Example FIR inner loop on TI TMS320C54xx:
Number of filter taps
Repeat
DO <addr> UNTIL condition”
In ADSP 2100:
DO X ...
Address Generation
PCS = PC + 1
if (PC = x && ! condition)
PC = PCS
else
PC = PC +1
Lowers loop overhead
X
• Eliminates a few instructions in loops • Important in loops with small bodies
EECC722 - Shaaban
#67 lec # 8
Fall 2011 10-12-2011
Instruction Set (ISA) Comparison
DSP Processor
ISA
General-Purpose Processor
ISA
• Specialized, complex
instructions (e.g. MAC)
• Multiple operations per
instruction
mac x0,y0,a x: (r0) + ,x0
y: (r4) + ,y0
Code Size = 16 bits
Smaller Code Size
• Zero or reduced overhead
loops.
The above is addition to addressing mode
differences identified earlier (slide 65)
• General-purpose
instructions Less complex
• Typically only one operation
per instruction
mov *r0,x0
mov *r1,y0
mpy x0, y0, a
add a, b
mov y0, *r2
inc r0
inc rl
Code Size = 7 x 32 =
224 bits
(14X)
Larger Code Size
• No zero or reduced overhead
loops support
EECC722 - Shaaban
#68 lec # 8
Fall 2011 10-12-2011
DSP Architectural Features
Specialized Peripherals for DSPs
System on Chip (SoC) Approach
Heavy integration of peripherals/components to reduce cost (chip count)/power
•
Instruction
Memory
DSP
Core
A/D Converter
Data
Memory
D/A Converter
SOC
Serial Ports
• Synchronous serial
ports
• Parallel ports
• Timers
• On-chip A/D, D/A
converters
• Co-processors.
• ASIC
• Micro-controller
….
• Program/data
memory and busses
• Component /system
interconnects
• Host ports
• Bit I/O ports
• On-chip DMA
controller
• Clock generators
On-chip peripherals often designed for “background” operation,
even when DSP core is powered down.
EECC722 - Shaaban
#69 lec # 8
Fall 2011 10-12-2011
TI TMS320C203/LC203 Block Diagram
DSP Core Approach - 1995
Program
Data
Integrated
DSP Peripherals
EECC722 - Shaaban
#70 lec # 8
Fall 2011 10-12-2011
Summary of Architectural Features of DSPs
•
DSP Architectural
Feature
•
DSP Architectural
Features
•
DSP ISA Feature
•
•
•
Data path configured for DSP
– Fixed-point arithmetic Most common 95% of all DSPs
– Fast MAC- Multiply-accumulate
Multiple memory banks and buses • Avoiding dynamic processor
architectural features that make real– Harvard Architecture
time performance harder to predict (e.g
dynamic scheduling, hardware
– Multiple data memories
speculation, branch prediction, cache).
– Dedicated address generation units
Why?
Specialized addressing modes
To achieve predictable real-time
performance
– Bit-reversed addressing
– Circular buffers
Specialized instruction set and execution control
– Zero-overhead loops
DSP ISA Features
– Support for MAC
Specialized peripherals for DSP (SoC)
THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE
DESIGN.
(or algorithm driven, DSP algorithms in this case)
EECC722 - Shaaban
#71 lec # 8
Fall 2011 10-12-2011
DSP Software Development Considerations
• Different from general-purpose software development:
–
–
–
–
–
–
Thus
Resource-hungry, complex algorithms.
Requirements
Specialized and/or complex processor architectures.
Severe cost/storage limitations.
Hard real-time constraints.
Optimization is essential.
Program in DSP Assembly ?
Increased testing challenges.
• Essential tools:
•
Most common (for performance) but changing
– Assembler, linker.
HLL/tools becoming
– Instruction set simulator.
more mature/
– HLL Code generation: C compiler.
gaining popularity
– Debugging and profiling tools.
Increasingly important:
– DSP Software libraries (hand optimized).
– Real-time operating systems.
EECC722 - Shaaban
#72 lec # 8
Fall 2011 10-12-2011
Classification of Current DSP Architectures
• Modern Conventional DSPs:
Lower
Cost/
Power
– Similar to the original DSPs of the early 1980s
– Single instruction/cycle. Example: TI TMS320C54x
Second
Generation
– Complex instructions/Not compiler friendly
Late 1980’s -
Usually one MAC unit
• Enhanced Conventional DSPs:
–
–
–
–
Add parallel execution units: SIMD operation
Complex, compound instructions. > 1 MAC Unit
Example: TI TMS320C55x
Not compiler friendly
Usually more than one MAC unit
• Multiple-Issue DSPs:
Late1990’s -
Third
Generation
Early 1990’s -
Fourth
Generation
– VLIW Example: TI TMS320C62xx, TMS320C64xx
Higher
Cost/
Power
Performance
• Simpler (RISC-like, fixed-width) instructions than conventional DSPs, more
instructions and instruction bandwidth needed,
• More compiler friendly
- Higher cost/power
• SIMD instructions support added to recent DSPs of this class
– Superscalar, Example: LSI Logic ZPS400, ZPS500
EECC722 - Shaaban
DSPs from all these three generations are still available today. Why?
#73 lec # 8
Fall 2011 10-12-2011
A Conventional DSP:
TI TMSC54xx
•
•
•
•
Second
Generation DSP
~ 1989
16-bit fixed-point DSP.
Issues one 16-bit instruction/cycle
Modified Harvard memory architecture
Peripherals typical of conventional DSPs:
– 2-3 synch. Serial ports, parallel port
– Bit I/O, Timer, DMA
• Inexpensive (100 MHz ~$5 qty 10K).
• Low power (60 mW @ 1.8V, 100 MHz).
Has one MAC unit
EECC722 - Shaaban
#74 lec # 8
Fall 2011 10-12-2011
A Current Conventional DSP:
Second
TI TMSC54xx
Generation DSP
One
MAC
Unit
EECC722 - Shaaban
#75 lec # 8
Fall 2011 10-12-2011
An Enhanced Conventional DSP:
Generation
TI TMSC55xx Third
DSP ~ 1994
• The TMS320C55xx is based on Texas Instruments' earlier
TMS320C54xx family, but adds significant enhancements to
the architecture and instruction set, including:
– Two instructions/cycle
(limited VLIW?)
• Instructions are scheduled for parallel execution by the assembly
programmer or compiler.
– Two MAC units.
• Complex, compound instructions:
– Assembly source code compatible with C54xx
– Mixed-width instructions: 8 to 48 bits.
– 200 MHz @ 1.5 V, ~130 mW , $17 qty 10k
• Poor compiler target.
2nd generation DSP
EECC722 - Shaaban
#76 lec # 8
Fall 2011 10-12-2011
An Enhanced Conventional DSP:
Third
TI TMSC55xx
Generation DSP
2 MAC
Units
EECC722 - Shaaban
#77 lec # 8
Fall 2011 10-12-2011
Multiple-Issue DSPs
16-bit Fixed-Point 8-way VLIW DSP:
TI TMS320C6201 Revision 2 (1997)
The TMS320C62xx is the
first fixed-point DSP
Program Cache / Program Memory
processor from Texas
32-bit address, 256-Bit data512K Bits RAM
Instruments that is based
Pwr
Dwn
on a VLIW-like architecture
which allows it to execute up
to eight 32-bit RISC-like
instructions per clock cycle.
Control
Registers
Instruction Dispatch
4-DMA
Instruction Decode
Data Path 1
Data Path 2
A Register File
Control
Logic
B Register File
Test
Emulation
Floating Point version
Example Fourth
Generation DSP
Program Fetch
Host
Port
Interface
TMS320C67xx
• More compiler friendly
• Higher cost/power
•SIMD instructions support added
to recent DSPs of this class
(TMS320C64xx)
C6201 CPU Megamodule
Ext.
Memory
Interface
L1
S1
M1
D1
D2 M2
S2
L2
Interrupts
2 Timers
Data Memory
32-Bit address, 8-, 16-, 32-Bit data
512K Bits RAM
2 Multichannel
buffered
serial ports
(T1/E1)
EECC722 - Shaaban
#78 lec # 8
Fall 2011 10-12-2011
TI TMS320C62xx Internal Memory
Architecture
•
Separate Internal Program and Data Spaces
• Program
– 16K 32-bit instructions (2K Fetch Packets)
– 256-bit Fetch Width
– Configurable as either
• Direct Mapped Cache, Memory Mapped Program Memory
• Data
– 32K x 16
– Single Ported Accessible by Both CPU Data Buses
– 4 x 8K 16-bit Banks 4 Banks
• 2 Possible Simultaneous Memory Accesses (4 Banks)
• 4-Way Interleave, Banks and Interleave Minimize Access Conflicts
EECC722 - Shaaban
#79 lec # 8
Fall 2011 10-12-2011
Fourth
Generation DSP
TI TMS320C62xx Datapaths
8-way VLIW
Registers A0 - A15
Registers B0 - B15
1X
S1
2X
S2
D DL SL
L1
SL DL D S1
S1
S2
D S1
S2
M1
DDATA_I1
(load data)
DDATA_O1
(store data)
D S1 S2
S2 S1 D
S2
S1 D
D1
D2
M2
S2
S1 D DL SL
S2
SL DL D
S2
S1
L2
DDATA_I2
(load data)
DDATA_O2
(store data)
DADR1 DADR2
(address) (address)
Cross Paths
40-bit Write Paths (8 MSBs)
40-bit Read Paths/Store Paths
EECC722 - Shaaban
#80 lec # 8
Fall 2011 10-12-2011
TI TMS320C62xx Functional Units
• L-Unit (L1, L2)
– 40-bit Integer ALU, Comparisons
– Bit Counting, Normalization
• S-Unit (S1, S2)
– 32-bit ALU, 40-bit Shifter
– Bitfield Operations, Branching
• M-Unit (M1, M2)
– 16 x 16 -> 32
• D-Unit (D1, D2)
– 32-bit Add/Subtract
– Address Calculations
(Statically Scheduled)
EECC722 - Shaaban
#81 lec # 8
Fall 2011 10-12-2011
TI TMS320C62xx Instruction Packing
Instruction Packing Advanced 8-way VLIW
Example 1
A B C D E F G H
A
B
C
D Example 2
E
F
G
H
A B
C
D Example 3
E
F G H
• Fetch Packet
– CPU fetches 8 instructions/cycle
• Execute Packet
– CPU executes 1 to 8 instructions/cycle
– Fetch packets can contain multiple execute packets
• Parallelism determined at compile / assembly time
• Examples
– 1) 8 parallel instructions
– 2) 8 serial instructions
– 3) Mixed Serial/Parallel Groups
• A // B
• C
• D
• E // F // G // H
• Reduces Codesize, Number of Program Fetches, Power
Consumption
(Statically Scheduled VLIW)
EECC722 - Shaaban
#82 lec # 8
Fall 2011 10-12-2011
TI TMS320C62xx Pipeline Operation
Pipeline Phases
Fetch
Decode
Execute
PG PS PW PR DP DC E1 E2 E3 E4 E5
• Single-Cycle Throughput
• Operate in Lock Step
• Fetch
– PG
Program Address Generate
– PS
Program Address Send
– PW
Program Access Ready Wait
– PR
Program Fetch Packet Receive
PG PS PW PR DP DC
Execute Packet 2 PG PS PW PR DP
Execute Packet 3 PG PS PW PR
Execute Packet 4 PG PS PW
Execute Packet 5 PG PS
Execute Packet 6 PG
Execute Packet 7
•
•
E1
DC
DP
PR
PW
PS
PG
Decode
– DP
– DC
Execute
– E1 - E5
E2
E1
DC
DP
PR
PW
PS
E3
E2
E1
DC
DP
PR
PW
E4
E3
E2
E1
DC
DP
PR
Instruction Dispatch
Instruction Decode
Execute 1 through Execute 5
E5
E4
E3
E2
E1
DC
DP
E5
E4
E3
E2
E1
DC
E5
E4
E3
E2
E1
E5
E4 E5
E3 E4 E5
E2 E3 E4 E5
EECC722 - Shaaban
#83 lec # 8
Fall 2011 10-12-2011
C62x Pipeline Operation
Delay Slots
•
Delay Slots: number of extra cycles until result is:
– written to register file
– available for use by a subsequent instructions
– Multi-cycle NOP instruction can fill delay slots while minimizing
code size impact
Most Instructions
Integer Multiply
Loads
Branches
E1 No Delay
E1 E2 1 Delay Slots
E1 E2 E3 E4 E5 4 Delay Slots
E1
Branch Target PG PSPWPR DPDC E1 5 Delay Slots
(Statically Scheduled VLIW)
For better real-time performance predictability
EECC722 - Shaaban
#84 lec # 8
Fall 2011 10-12-2011
C6000 Instruction Set Features
Conditional Instruction Execution
• All Instructions can be Conditional (similar to Intel IA-64)
– A1, A2, B0, B1, B2 can be used as Conditions
– Based on Zero or Non-Zero Value
– Compare Instructions can allow other Conditions (<, >, etc)
• Reduces Branching
• Increases Parallelism
EECC722 - Shaaban
#85 lec # 8
Fall 2011 10-12-2011
C6000 Instruction Set Addressing
Features
• Load-Store Architecture
• Two Addressing Units (D1, D2)
• Orthogonal
– Any Register can be used for Addressing or Indexing
• Signed/Unsigned Byte, Half-Word, Word, DoubleWord Addressable
– Indexes are Scaled by Type
• Register or 5-Bit Unsigned Constant Index
EECC722 - Shaaban
#86 lec # 8
Fall 2011 10-12-2011
C6000 Instruction Set
Addressing Modes/Features
• Indirect Addressing Modes
– Pre-Increment *++R[index]
– Post-Increment *R++[index]
– Pre-Decrement *--R[index]
– Post-Decrement *R--[index]
– Positive Offset *+R[index]
– Negative Offset *-R[index]
• 15-bit Positive/Negative Constant Offset from Either B14 or
B15
• Circular Addressing
– Fast and Low Cost: Power of 2 Sizes and Alignment
– Up to 8 Different Pointers/Buffers, Up to 2 Different Buffer
Sizes
• Bit-reversal Addressing
• Dual Endian Support
EECC722 - Shaaban
#87 lec # 8
Fall 2011 10-12-2011
FIR Filter On TMS320C54xx vs. TMS320C62xx
2nd Gen Conventional DSP
4th Gen VLIW DSP
VLIW DSP: Larger code size
Two filter taps
In parallel
EECC722 - Shaaban
#88 lec # 8
Fall 2011 10-12-2011
TI TMS320C64xx
• Announced in February 2000, the TMS320C64xx is an extension
of Texas Instruments' earlier TMS320C62xx architecture.
• The TMS320C64xx has 64 32-bit general-purpose registers, twice
as many as the TMS320C62xx.
• The TMS320C64xx instruction set is a superset of that used in the
TMS320C62xx, and, among other enhancements, adds significant
SIMD/media processing capabilities:
Not in C62
– 8-bit operations for image/video processing.
SIMD
Media Processing
• Introduced at 600 MHz clock speed (1 GHz now), but:
– 11-stage pipeline with long latencies
– Dynamic caches.
• $100 qty 10k.
• The only DSP current family with compatible fixed and floatingpoint versions.
EECC722 - Shaaban
#89 lec # 8
Fall 2011 10-12-2011
C64xx (also C62xx and C67xx) VLIW have higher memory use
due to simpler (RISC-like, fixed-width) instructions than conventional DSPs,
more instructions and instruction bandwidth needed,
(VLIW)
(VLIW)
Also VLIW but with variable-length instruction encoding (less memory use than C64xx)
(16-32 bits)
EECC722 - Shaaban
#90 lec # 8
Fall 2011 10-12-2011
Computational
(XScale)
EECC722 - Shaaban
#91 lec # 8
Fall 2011 10-12-2011
Multiple-Issue 4th Generation DSPs Example
Superscalar DSP: LSI Logic ZSP400
• A 4-way superscalar dynamically scheduled 16-bit fixedpoint DSP core.
Good or bad for a DSP?
• 16-bit RISC-like instructions
• Separate on-chip caches for instructions and data
• Two MAC units, two ALU/shifter units
– Limited SIMD support.
– MACS can be combined for 32-bit operations.
• Possible Disadvantage:
– Dynamic behavior complicates DSP software development:
• Ensuring real-time behavior
• Optimizing code.
EECC722 - Shaaban
#92 lec # 8
Fall 2011 10-12-2011
2004
EECC722 - Shaaban
#93 lec # 8
Fall 2011 10-12-2011
2010
EECC722 - Shaaban
#94 lec # 8
Fall 2011 10-12-2011
2004
GPP
(4th generation TI DSP)
TI not actively improving their flagship
FP DSP (fixed-point more important!)
EECC722 - Shaaban
#95 lec # 8
Fall 2011 10-12-2011

Computing Engine Choices

Transcript Computing Engine Choices

Directory