Transcript C66x

How to realize high-performance compute with
Multicore DSP
1
C667x Target Applications (Non- Telecom)
Mission Critical
Video Infrastructure
Test and Automation
Infrastructure Audio
HPC, Imaging and
Medical
Emerging Others
Emerging
Broadband
Innovations
TI Confidential – NDA Restrictions
RF and Communication Applications
Military & Defense
Avionics
Application
ISR (Intelligence/Surveillance/Reconnaissance)
o SIGINT/COMINT/Signal Generators
Military Communications.
o SDR(JTRS)-Manpack/LMR/Fixed
o Comm. Infra - VoIP/Video Gateways
Satellite\Avionics Communications
o Ground Receiver/Repeaters
o Weather Radar
FAA – Civil Aviation/Govt Comm.
Conventional PS – TETRA/APCO/E911
o Wireless Infrastructure
o Comm. Infra - VoIP/Video Gateways
Emerging Broadband (OFDM/LTE/WiMAX)
o Utilities/Transport/Smart Grid
TI Confidential – NDA Restrictions
Govt & Public Safety
Key Customer Careabouts
•Long Term Partnership
•Financial Stability
•Strong Roadmap and R&D
•Floating Point Performnce
•Size, Weight, and Power (SWaP)
•I/O Bandwidth
•Longevity of supply (10+yrs)
3
3
RF and Comm. Product Requirements





End Product Need
Support Multiple Waveforms
Common Platform for
TDMA/CDMA/OFDMA
Multi-channel VoIP/Video
capability
Support FEC and Modulation
TCP/IP Networking support
DSP Requirement
 Needs Raw Performance in
terms of MIPS/GHz/MMACS
 Floating Point Capable ISA to
achieve “precision” and high
GFLOPS.
 Large On Chip RAM
– Reduce accesses to slow
external memory.
 High Speed External Memory
Interface
 Large addressable memory
 Efficient DMA architecture
 Wireless specific accelerators
and TCP/IP Offload
TI Confidential – NDA Restrictions
4
Imaging Product Requirements
End Product Need
 High BW Interface
 RF Front End and Telecom ports
 Connect Multiple DSPs on a
board e.g. in ATCA Card
 High BW Backplane and Network
Connectivity
 Reliability in Mission Critical
Designs
 Low Power Design
 Ease of Use
TI Confidential – NDA Restrictions
DSP Requirement
 Needs multiple high speed
interfaces
– PCIe ,Serial RapidIO
– OBSAI/CPRI Interface
– Gigabit Ethernet etc
 Memory Error Correction & Checking
(ECC)
 Efficient Low Power DSPs
 Support Extended Temp ranges from
-40oC to 105oC and others Temp





Dev and Debug Tools
Multicore S/W Frameworks
Signal/Image Processing functions.
VoIP Library
Audio/Video Codecs
5
Introducing “Keystone Architecture” (C66x)
The Best Combination of Performance (GHz) and Power Consumption in the Industry
16GFLOPs & 32GMACS per Core @ 1GHz
Next-Generation
C66x DSP Core
C64x+ Core (Fixed pt)
Fixed and Floating-point Core
@ 1.25 GHz
C64x+
Lowest Power Highest Performance DSP Core
C67x Core (Floating pt)
Fixed
Point
Floating
Point
4x C64x+ MAC (32)
4xC67x Fl pt MAC(8)
16FLOP/cy compared to 6FLOP/cy
NEW
MultiCore
DSP
C66x
100% Code Compatible with all
C64x (fixed) & C67x (floating)
Devices
C67xx
Similar Power Profiles as C64x Core
Supported by Code Composer Studio
IDE
Industry’s Lowest Power FP DSP Core
High precision and wide dynamic range
KEYSTONE
Architecture
TI Confidential – NDA Restrictions
8 Core C6678 based on C66x core
delivers 320 GMACs/160GFLOPS
@ 1.25GHz/Core
(effectively a 10GHz DSP)
6
Unmatched Performance
BDTImark2000 TM Score
ADI 2116x (SHARC)
NEC uPD77050
ADI 2126x (SHARC)
ADI BF5xx (Blackfin)
ADI 213xx (SHARC)
ADI TS201S(TigerSHARC)
ADI TS201S (TigerSHARC)
ADI TS202S/203S (TigerSHARC)
ADI TS202S/203S (TigerSHARC)
Freescale MSC81xx (SC140)
Intell Pentium III
Freescale MSC814x (SC3400)
Renesas SH77xx (SH-4)
Freescale MSC815x (SC3850)
TMS320C67x
TMS320C64x+
TMS320C66xx
TMS320C66xx
0
2000
4000
6000
8000
10000
12000
0
14000
BDTI Score for Floating Point Processors
Algorithm
Single Precision Floating Point FFT,
2048 pt, Radix 4
5000
10000
20000
BDTI Score for Fixed Point Processors
C67x @
300MHz
C64x+
@1.2GH
z
86.84 us
C66x
@1.25GH
z
Gain
14.00 us*
~600
%
Fixed Point FFT, 2048 pt, Radix 4
8.23 us
4.46 us*
~200
%
FIR Filter, 40 samples, 40 taps
0.69 us
0.34 us*
~200
%
Matrix Multiply 32 x 32
17.92 us
6.16 us*
~300
%
0.53 us
0.13 us*
~400
%
Matrix Inverse 4 x 4
TI Confidential – NDA Restrictions
15000
25000
TI Multicore KeyStone Architecture
• Highest Integration
Multicore Navigator
Network on Chip
– Cost & Power 
• Common Architecture
C66x, ARM
Processing Cores
Multicore Shared Memory Controller
– Portable Software
• Scalable
–  Tailored Solutions
Shared Memory
• Navigator
– Innovative Multi-core
System Management
TeraNet 2
(Debug, Clocking, Power)
• Floating Point
– Development Time 
Application Accelerator
Application Accelerator
• Tools & Debugging
– R&D Efficiency 
• Quality Software
– Solutions & Libraries
High Speed I/O
HyperLink
50
The first network on chip infrastructure to unleash full multicore entitlement
8
TI Confidential – NDA Restrictions
8
Product Highlights: C6670 and C6678
C6670
C6678
Performance Optimized Core
Power Optimized Core

Next Generation C66x Core
- 4 C66x Cores @ 1GHz - 1.2GHz


Memory Architecture
- 4MB Local L2/Core (1MB per Core)
- 2MB Multicore Shared Memory
Next Generation C66x Core
- Up to 8 C66x Cores @ 1GHz -1.25GHz
- Available Options: 1, 2, 4, and 8 Core Devices

Memory Architecture
- 4MB Local L2/Core (512KB per Core)
- 4MB Multicore Shared Memory

Power Optimized Core
- <10W at 1Ghz nominal temp
Communication Accelerators
- TCP3e (Turbo Encode) – Up to 550Mbps
- TCP3d (Turbo Decode) – Up to 600Mbps
- FFTC – 2048 FFT every 4.6µs
- VCP2 for voice channel decoding
Multicore Navigator
L1
C66X
DSP
C66X
DSP
L1
L2
4x VCP2
L2
TeraNet
L2
Communications
CoProcessors
L1
L2
3x TCP3d
2x RAC
1x TAC
3x FFTC
BCP
Network
CoProcessors
Crypto
Multicore Shared Memory Controller
(MSMC)
Shared Memory 2MB
System Elements
Power Management
SysMon
Debug
EDMA
TI Confidential – NDA Restrictions
TI Confidential – NDA Restrictions
Peripherals & IO
HyperLink
DDR364b
C66X
L1
L1
L2
DSP
L2
C66X
DSP
C66X
DSP
L1
L1
L2
L2
C66X
DSP
C66X
DSP
C66X
DSP
C66X
DSP
L1
L1
L1
L1
L2
L2
L2
Network
CoProcessors
L2
Memory Subsystem
Packet
Accelerator
Memory Subsystem
8 x CorePac
C66X
DSP
TeraNet
C66X
DSP
C66X
DSP
L1
Multicore Navigator
SRIO
x4
PCIe
x2
AIF2
x6
SGMII
x2
I2C
SPI
UART
System Elements
Power Management
Debug
SysMon
EDMA
Crypto
Packet
Accelerator
IP Interfaces
GbE
Switch
SGMII
SGMII
Multicore Shared Memory Controller
DDR3(MSMC)
64b
Shared Memory 4MB
HyperLink

Peripherals & IO
SRIO
x4
PCIe
x2
EMIF
16
TSIP
x2
I2C
SPI
UART
9
Innovation & Integration via C6678 DSP Highlights
Multicore Navigator
C66x Core
Data transfer engine that is architected to move data between
various system elements without using any CPU overhead so
maximum system efficiency is achieved
Next generation Fixed / Floating-Point DSP core with
clock speeds ranging from 1GHz– 1.25GHz and Up to 8
core options
Multicore Navigator
8 x CorePac
C66X
C66X
C66X
DSP
DSP
DSP
L1 L2
L1 L2
L1 L2
L1 L2
C66X
DSP
C66X
DSP
C66X
DSP
C66X
DSP
L1 L2
L1 L2
L1 L2
L1 L2
Memory Subsystem
DDR364b
Power Management
Debug
S/W Dev and Debug Support
Leveraged by CCS
HyperLink
Ultra high-speed ( up to 50 Gbaud), low latency
serial interface that connects to other DSPs and
FPGAs in the systems
TI Confidential – NDA Restrictions
SysMon
EDMA
Crypto
Packet
Accelerator
IP Interfaces
Network Co- Processor and
Accelerators
A cost effective implementation to
off-load the TCP/IP and secure
networking functions from the DSP
GbE
Switch
SGMII
SGMII
Multicore Shared Memory
Controller (MSMC)
Shared Memory 4MB
System Elements
Improved Debug
Network
CoProcessors
Peripherals & IO
HyperLink
• 0.5 MB of local Memory per core;
• 4 MB of Shared Memory.
• Enhanced memory architecture
through an enhanced Multicore
Shared memory Controller
• Bottleneck free fast on- and offchip memory access including a
DDR3-1333MHz (64-bit) interface
• L1/L2/L3 ECC
C66X
DSP
TeraNet
Memory Architecture
SRIO
x4
PCIe EMIF
x2
16
TSIP
x2
I2C
SPI
UART
TeraNet
Switch fabric that has 2 Terabits of
bandwidth which allows maximum
data transfer between system
components to realize full system
entitlement
Peripherals and I/O Interfaces
High bandwidth peripherals that operate independently (NOT Shared)
allowing simultaneous data transfer to prevent bottle necks - featuring:
 RapidIO v2.1 – 4lanes @ 5Gbps with 1x, 2x and 4x support
 PCIe x2 – 2lanes, running independently of RapidIO
10
Competitive Analysis
Value Prop against FPGA
•C66x Performance
– 320GMACS/160GFLOP
– Baseband on a chip. Handles
multiple waveforms supporting
OFDM,CDMA,TDM
– L1/L2/L3 Processing capability
– Wireless Accelerators
(VCP/TCP/FFT)
•Software Programmability
– Time To Market
•Smaller Package
(more DSP/Board)
•Lower Power
– smaller battery, simpler cooling
Value Prop against other DSPs
•C66x Fixed & Floating Point [email protected]
– Industry’s Fastest DSP at 10GHz
•On-Chip RAM up to 8MB
•DDR3
– 1600MHz, 64Bit, 8GB Address space
•Multiple Independent High Speed IO
– 4xsRIOv2.1,2xPCIe Gen II, 2xSGMII, 2xTSIP
•High BW FPGA connectivity
– Hyperlink @ 50Gbps
•1/2/4/8 Core Option (Pin Compatible)
•L1/L2/L3 Memory ECC – System Reliability
•Low Power per GFLOPs and GMACS
•Extended Temp support -40oC to 105oC
•CCS Tools + S/W Collateral
•3rd Party Network
•Low Cost - MIPs/$
11
TMDXEVM6678L EVM
Singe wide AMC form factor
C6678
Code Composer Studio™ IDE
H/W Development Tools
*Design *Code and Build *Debug *Analyze *Tune
CCSv5 Allows designers of all experience levels to move quickly
through application development (www.ti.com/ccstudio)
•Time Limited FREE Evaluation Versions available for download.
Includes C667x Simulator
EVM Kit includes
•BIOS 6.x,
•BIOS-MCSDK / LINUX-MCSDK 2.0 (NDK, PDK, LIB etc),
•Sample Program and Out of box demo (OOB) e.g.
•
I/O Benchmark, Imaging Processing Pipeline and High
Performance DSP Utility Application (HUA)
•User Guide, Starter guide, Tech Ref Guide, App Notes etc
•
•
•
•
•
TMDXEVM6678L – EVM with XDS100 emulation $399
TMDXEVM6678LE – EVM with XDS560V2 emulation
- $599
TMDXEVM6678LXE – EVM with XDS560V2
emulation –Encryption Enabled - $599
TMDSEMU560v2STM-UE - XDS560v2 System Trace
Emulator with 128Mb System Trace buffer and
Ethernet / USB support
Optional PCIe adapter card to connect the C6678
EVM to a standard PCI header of a desktop.
TI’s Multicore Hardware Ecosystem
Others
Chassis / System
Standardized Boards
PCIExpress (with Gen 2)
Advanced Mezzanine (AMC)
Custom
ATCA
Other
TI’s Multicore Software Ecosystem
Customer Application
Multicore Entitlement
Layer 2+
Layer 1 UMTS
IP Network
Stack
Layer 1 LTE
TI Runtime
TI’s Device Entitlement Libraries
TI Layer 1 Libraries
TI BIOS, Linux, OSE(ck)
Multicore Tools and Software (MC-SDK)
• Tools
– Codegen with OpenMP
support
– Emulator/Debugger
– Simulator
– Profiler / DVT
– 3rd party tools
• Software
– BIOS/Linux SDK
• Multicore Demonstration
• 6.x DSP BIOS
– Platform Abstraction
– Basic Networking
– Inter core communication
Eclipse
DSP
Customer Application
Code
Composer
StudioTM
Editor/IDE
Compiler
Linker
(Codegen)
Third
Party
Plug-Ins
Multicore Software Development Kit
Polycore
Demo App
Multicore
BIOS
ENEA
Optima
DSPLIB
IMGLIB
3L
Profiler
Speech
Codec
Demo App
Multicore
BIOS and
Linux
Demo App
Multicore
Linux
NDK
Audio
Codec
Video
Codec
Operating System w/ Boot Loader
BIOS
Debugger
Linux
Multicore Entitlement
Remote
Debug
Inter Core Communication
Full Silicon Entitlement
SoC
Analyzer
Platform Development Kit
• Application Specific Libraries
–
–
–
–
Audio/Video CODECS
VoIP Components
WiMAX Toolkit, LTE Toolkit,
DSPLib
• others..
TI Confidential – NDA Restrictions
Target Board
Host Computer
XDS 560 V2
XDS 560 Trace
15
KeyStone Multicore Software – Libraries & Codecs
Digital Signal Processing
• FFT
• Adaptive Filtering
• Filtering and
convolution
• Others…..
• Available free from TI
Image Processing
• Edge Detection
• Boundary
• Morphology
• Others…..
• Available free from TI
Voice and Fax
• Line Echo
Cancellation
• Voice Activity
Detection
• Others…
• Available free from TI
Security/Cryptography
• AES, SHA1, 3DES
Vision Lib (object only)
• 50+ royalty-free kernels:
Libraries
MATLAB
• Image processing
• Math operations
Vision Analytics
Voice
• G.711, G.722
• G.723, G.729
• CDMA, AMR(NB/WB),
EVRC-B
• Others
Codecs
Fax
•
•
T.38
Fax Modem
Video
•
•
•
•
•
•
H.263
H.264
MPEG2
MPEG4
VC1/WMV9 Decode
Others
• Background modeling &
subtraction
• Object feature extraction
• Tracking, recognition
• Low-level pixel processing
Audio
• MPEG1 Layer2
• AAC LC/HE
• AC3 2.0/5.1
• Sample Rate
Conversion
High-Performance and Multicore Processor
High Value
Keystone Architecture
High-Performance
at the Right
Power & Price
Low-Cost EVM
Open & Affordable Tools
Easy to Use
Training
Product Collateral
Drivers &
Example Code
Quick to Market
User Community
Enabler Software
Quick-Start Hardware
Benchmarks & Functional
Understanding
Frameworks &
Abstraction
Generic
Libraries
Application
Libraries
Getting Started – More Information/Links
•
Product Folders:
–
–
•
EVMs and Software Tools:
–
–
–
–
–
–
–
•
C66X Informational Wiki Page
All C6000 Multicore DSPs
• TMS320C6670
• TMS320C6678
TMS320C6678 EVM
TMS320C6670 EVM
AMC to PCIe Adapter Card
Multicore Software Development Kit for BIOS & Linux
• MCSDK Wiki
• CCS v5 Wiki
• C66x Linux Wiki
DSP Signal Processing Library(DSPLIB)
Image and Video Processing Library (IMGLIB)
LTE /WiMAX Toolkit – Discuss with BDM
Technical Support
–
–
TI E2E Community (Online Support)
Product Training
TIConfidential
Confidential
– NDA Restrictions
TI
– NDA Restrictions
Online Video Training
http://focus.ti.com/docs/training/catalog/events/event.jhtml?sku=OLT110027
TI Confidential – NDA Restrictions
Mission Critical DSP Market
•
•
Undisputed #1 DSP and SoC supplier
–
Strong Growth for 8 years in a row, even in 2009
–
Higher R&D spending than DSP revenue of most competitors
Revenue
“What Customers Like about TI”
KeyStone SoC Architecture secures future success
–
Rich Product Portfolio & Strong Roadmap
–
2 Families with multiple devices and growing
–
•
Nyquist(6670), Shannon(6678/4/2)
•
40nm -> 28nm
•
Tools/Software & Compilers
•
3rd Party Eco-System
Multiple Design Wins Pre-Announcement
2002
2009
TI SoC
Architecture
Layer 1
Macro
Pico
Femto
PHY
Software
Radio
IP Network
•
Secure Supply – No DSP product discontinuation (end of life)
•
History of delivery upon promises (Power, GHz, ..)
•
Field Experience - Completeness of system analysis, Architecture, Internal Switch, ….
•
Customer Support
•
Business Model - Long Term relationships with key customers
– Actively seek and incorporate customer feedback in roadmap devices.
TI Confidential – NDA Restrictions
Backup Slides
Product Details
21
C6678 (Shannon) “Lightning” Half-Length PCIe Card Feature Set
 TI TMS320C6678 (8-core) x 4
― C66x Core Frequency: 1.25GHz
― DDR3 Memory
― Data Frequency: 1600MHz
― Data Bus Width: 64-bit
― Serial RapidIO Gen-2 Interface
― PCIe Gen-2 Interface
― 10/100/1000Mbps Ethernet w/ SGMII
― Hyperlink50 Interface





1024 MB DDR3-1333 on board
PLX PEX8624 PCIe Gen-2 Switch
Serial RapidIO daisy-chain
Ethernet daisy-chain
Each DSP device is linked to PCIe
switch by x2 lanes
 Dual DSPs linked by Hyperlink50
 Power: Max 54Watts
TI Confidential – NDA Restrictions
What is Hyperlink?
“high-speed, low-latency, and low-pin-count communication interface”
•Low pin count (24 pins)
•Point to Point Connection
•Interconnect
•DSP-to-DSP
•DSP-to-FPGA.
•SerDes for data transfer
• x1 x4 modes for Tx and Rx
•12.5GBaud/lane
•Effectively 8b9b encoding
•LVCMOS sideband signals for
flow control & power mgmt
- errors/events/timeouts
* Simple packet-based transfer protocol for memory-mapped access
* Read/Write to DSP/FPGA local memory
Up to 64 Memory mapped Regions
each region up to 256MB
TI Confidential – NDA Restrictions
- discrete memory access of any byte aligned width up to 64bits.
- burst transfer modes
•
Write (Maximum Burst Size 256Bytes)
– Write Request --->
– Data Packet --->
•
Read (Maximum Burst Size 256Bytes)
– Read Request --->
– Read Response •
Interrupt Request <-->
23
Universal Parallel Port (uPP)
•
What is it?
–
–
–
–
•
Application
–
Each channel can interface cleanly with high-speed ADCs and/or
DACs with up to 16-bit data width (per channel).
–
Useful as low cost interface with FPGAs. Can run up to
120MByte/s per channel in single channel or bi-directional
mode ( 240MByte for both channels in unidirectional mode)
Can also be used to interface two C6655/57 devices or to
connect C6655/57 with C674x or OMAP-L13x family of
devices.
–
•
Parallel bus, two independent channels (separate data buses)
I/O speeds up to 75 MHz with 8-16 bit data width per channel
1 or 2 channel parallel interface operating in RX, TX or FD
mode
Supports Double data rate mode of operation (Bandwidth
does not change/increase)
Other benefits
–
–
–
–
–
Throughput Estimates:
Internal DMA – leaves CPU EDMA free
Simple protocol with few control pins (configurable: 2-4 per
channel)
Multiple data packing formats for 9-15 bit data widths
Interleave mode (single channel only)
Simple interface: IO Queued by software
Note: Max. clock of 50 MHz in (*) configuration
TI Confidential – NDA Restrictions
Thank You
25