keystone-workshop.googlecode.com

Download Report

Transcript keystone-workshop.googlecode.com

KeyStone C66x Multicore
SoC Overview
Multicore Applications Team
Performance Improvement
Enhanced DSP core
C66x ISA
100% upward object code
compatible
4x performance
improvement for multiply
operation
32 16-bit MACs
Improved support for
complex arithmetic and
matrix computation
C674x
C67x+
C67x
2x registers
Native
instructions for
IEEE 754, SP&DP
Advanced VLIW
architecture
Enhanced
floating-point add
capabilities
FLOATING-POINT VALUE
100% upward object code
compatible with C64x, C64x+,
C67x and c67x+
Best of fixed-point and
floating-point architecture for
better system performance
and faster time-to-market
C64x+
SPLOOP and 16-bit
instructions for
smaller code size
Flexible level one
memory architecture
iDMA for rapid data
transfers between
local memories
C64x
Advanced fixedpoint instructions
Four 16-bit or
eight 8-bit MACs
Two-level cache
FIXED-POINT VALUE
KeyStone Device Architecture
Application-Specific
Coprocessors
Memory Subsystem
MSM
SRAM
64-Bit
DDR3 EMIF
MSMC
Debug & Trace
Boot ROM
Semaphore
C66x™
CorePac
Power
Management
PLL
L1P
L1D
Cache/RAM Cache/RAM
x3
L2 Memory Cache/RAM
EDMA
1 to 8 Cores @ up to 1.25 GHz
x3
TeraNet
HyperLink
Multicore Navigator
Switch
Ethernet
Switch
SGMII
x2
x4
SRIO
Application
Specific I/O
SPI
UART
x2
PCIe
I2C
GPIO
Application
Specific I/O
Queue
Manager
Packet
DMA
Security
Accelerator
Packet
Accelerator
Network Coprocessor
CorePac
Memory Subsystem
Multicore Navigator
Network Coprocessor
External Interfaces
TeraNet Switch Fabric
Diagnostic Enhancements
HyperLink Bus
Miscellaneous
Application-Specific
C66x CorePac
Application-Specific
Coprocessors
Memory Subsystem
MSM
SRAM
64-Bit
DDR3 EMIF
MSMC
Debug & Trace
Boot ROM
Semaphore
C66x™
CorePac
Power
Management
PLL
L1P
L1D
Cache/RAM Cache/RAM
x3
L2 Memory Cache/RAM
EDMA
1 to 8 Cores @ up to 1.25 GHz
x3
TeraNet
HyperLink
Multicore Navigator
Switch
Ethernet
Switch
SGMII
x2
x4
SRIO
Application
Specific I/O
SPI
UART
x2
PCIe
I2C
GPIO
Application
Specific I/O
Queue
Manager
Packet
DMA
Security
Accelerator
Packet
Accelerator
Network Coprocessor
CorePac
• 1 to 8 C66x CorePac DSP Cores operating
at up to 1.25 GHz
– Fixed- and floating-point operations
– Code compatible with other C64x+
and C67x+ devices
• L1 Memory
– Can be partitioned as cache and/or
RAM
– 32KB L1P per core
– 32KB L1D per core
– Error detection for L1P
– Memory protection
• Dedicated L2 Memory
– Can be partitioned as cache and/or
RAM
– 512 KB to 1 MB Local L2 per core
– Error detection and correction for all
L2 memory
• Direct connection to memory subsystem
C66x CorePac Architecture
Level 1 Program
Memory (L1P)
 Single-Cycle
 Cache / RAM
Level 2
Memory
(L2)

256
Instruction Fetch

Program / Data
Cache / RAM
DSP Core
M L
M L
S D
S D
Reg A [32] Reg B [32]
64-bit
Level 1 Data
Memory (L1D)
 Single-Cycle
 Cache / RAM
Memory
Controller
C66x CorePac
CorePac includes:
• DSP Core
• Two registers
• Four functional units per
register side
• L1P memory (Cache/RAM)
• L1D memory (Cache/RAM)
• L2 memory (Cache/RAM)
C66x DSP Core
Memory
•
A0
.D1
.D2
B0
•
.S1
.S2
MACs
.M1
..
A31
.L1
.M2
.L2
Controller/Decoder
..
•
B31
•
•
Four functional units per side:
o Multiplier (.M)
o ALU (.L)
o Data (.D)
o Control (.S)
These independent functional units
enable efficient execution of parallel
specialized instructions:
o Multiplier (.M1and.M2) and ALU
(.L1 and .L2) provide MAC
(multiple accumulation)
operations.
o Data (.D) provides data
input/output.
o Control (.S) provides control
functions (loop, branch, call).
Each DSP core dispatches up to eight
parallel instructions each cycle.
All instructions are conditional, which
enables efficient pipelining.
The optimized C compiler generates
efficient target code.
C66x DSP Core Cross-Path
Register File A
Register File B
Any 64-bit pair of
registers from A can
be one of the inputs
to a B functional
unit, and vice versa.
A0
A1
A2
B0
B1
B2
A3
B3
A4
B4
.
.
.
A31
A
B
.D1
.D1
.S1
.S1
.M1
.M1
.L1
.L1
.
.
.
B31
Partial List of .D Instructions
Partial List of .L Instructions
Partial List of .M Instructions
Partial List of .S Instructions
C66x CorePac Improvements Over C64x+
• Wider internal bus
– 64 bit for the .L and .S functional units
– 128 bit for the .M functional unit
• Wider cross path
– 64 bit for each direction
• 4x number of multipliers
– More SIMD instructions
• Enhanced instruction set
– More than 100 new instructions added (compared
to c64+)
Enhanced C66x Instruction Set
• New SIMD instructions:




QMPY32 – 4-way SIMD of MYP32
DDOTP4H – 2-way SIMD of DOTP4H
DPACKL2 – SIMD version of PACKL2
DAVGU4 – Average of 8 packed unsigned bytes
• New floating-point instructions:
 MPYDP – Double Precision Multiplication
 FMPYDP – Fast Double Precision multiplication
 DINTSP – 2-Way SIMD Convert 32-bits Unsigned
Integer to Single Precision Floating Point
Interesting New C66x Instructions
• MFENCE (Memory Fence) Stall instruction
pipeline until memory system is done.
• RCPSP (Single-Precision Floating-Point
Reciprocal Approximation)
• RSQRSP (Single-Precision Floating-Point
Square-Root Reciprocal Approximation)
C66x SIMD Instructions: Examples
•
•
ADDDP – Add Two Double-Precision Floating-Point Values
DADD2 – 4-Way SIMD Addition, Packed Signed 16-bit
–
–
–
•
•
Performs 4 additions of two sets of 4 16-bit numbers packed into 64bit registers.
The 4 results are rounded to 4 packed 16-bit values
unit = .L1, .L2, .S1, .S2
FMPYDP - Fast Double-Precision Floating Point Multiply
QMPY32 - 4-Way SIMD Multiply, Packed Signed 32-bit.
–
–
–
Performs 4 multiplications of two sets of 4 32-bit numbers packed
into 128-bit registers.
The 4 results are packed 32-bit values.
unit = .M1 or .M2
C66x SIMD Instruction: CMATMPY
Many applications use complex matrix arithmetic.
•
CMATMPY – 2x1 Complex Vector Multiply 2x2 Complex Matrix
– Results in 1x2 signed complex vector.
– All values are 16-bit (16-bit real/16-bit Imaginary)
– unit = .M1 or .M2
•
How many multiplications are complex multiplication, where each
complex multiplication has the following?
– 4 complex multiplications (4 real multiplications each)
– Two M units (16 multiplications each) = 32 multiplications
– Core cycles per second (1.25 G)
– Total multiplications per second = 40 G multiplications
– 8 cores = 320 G multiplications
The issue here is, can we feed the functional units data fast enough?
Feeding the Functional Units
There are two challenges:
• How to provide enough data from memory to the core
–
–
•
Access to L1 memory is wide (2 x 64 bit) and fast (0 wait state)
Multiple mechanisms are used to efficiently transfer new data to L1
from L2 and external memory.
How to get values in and out of the functional units
–
–
Hardware pipeline enables execution of instructions every cycle.
Software pipeline enables efficient instruction scheduling to
maximize functional unit throughput.
Internal Buses
Program Address
L1
Memories
Program Data
Data Address - T1
Data Data
L2 and
External
Memory
x32
PC
x256
Fetch
x32
A
- T1
x32/64
Data Address - T2
x32
Data Data
- T2
x32/64
Regs
B
Regs
Peripherals
C62x: Dual 32-Bit Load/Store
C67x: Dual 64-Bit Load / 32-Bit Store
C64x, C674x, C66x: Dual 64-Bit Load/Store
Memory Subsystem
Application-Specific
Coprocessors
Memory Subsystem
MSM
SRAM
64-Bit
DDR3 EMIF
MSMC
• Multicore Shared Memory (MSM SRAM)
• 2 to 4 MB
• Available to all cores
• Can contain program and data
Debug & Trace
Boot ROM
Semaphore
C66x™
CorePac
Power
Management
PLL
L1P
L1D
Cache/RAM Cache/RAM
x3
L2 Memory Cache/RAM
EDMA
1 to 8 Cores @ up to 1.25 GHz
x3
TeraNet
HyperLink
Multicore Navigator
Packet
DMA
• Multicore Shared Memory Controller (MSMC)
• Arbitrates access of CorePac and SoC masters
to shared memory
• Provides a connection to the DDR3 EMIF
• Provides CorePac access to coprocessors and
IO peripherals
• Provides error detection and correction for
all shared memory
• Memory protection and address extension to
64 GB (36 bits)
• Provides multi-stream pre-fetching capability
• DDR3 External Memory Interface (EMIF)
Switch
Ethernet
Switch
SGMII
x2
x4
SRIO
Application
Specific I/O
SPI
UART
PCIe
I2C
GPIO
x2
Queue
Manager
Application
Specific I/O
CorePac
Memory Subsystem
Security
Accelerator
Packet
Accelerator
Network Coprocessor
• Support for 16-bit, 32-bit, and 64-bit modes
• Specified at up to 1600 MT/s
• Supports power down of unused pins when
using 16-bit or 32-bit width
• Support for 8 GB memory address
• Error detection and correction
Multicore Navigator
Application-Specific
Coprocessors
Memory Subsystem
MSM
SRAM
64-Bit
DDR3 EMIF
MSMC
Debug & Trace
Boot ROM
Semaphore
C66x™
CorePac
Power
Management
PLL
L1P
L1D
Cache/RAM Cache/RAM
x3
L2 Memory Cache/RAM
EDMA
1 to 8 Cores @ up to 1.25 GHz
x3
TeraNet
HyperLink
Multicore Navigator
Switch
Ethernet
Switch
SGMII
x2
x4
SRIO
Application
Specific I/O
SPI
UART
x2
PCIe
I2C
GPIO
Application
Specific I/O
Queue
Manager
Packet
DMA
Security
Accelerator
Packet
Accelerator
Network Coprocessor
CorePac
Memory Subsystem
Multicore Navigator
• Provides seamless inter-core
communications (messages and data
exchanges) between cores, IP, and
peripherals … “Fire and forget”
• Low-overhead processing and routing
of packet traffic to and from
peripherals and cores
• Supports dynamic load optimization
• Data transfer architecture designed to
minimize host interaction while
maximizing memory and bus
efficiency
• Consists of a Queue Manager
Subsystem (QMSS) and multiple,
dedicated Packet DMA engines
Network Coprocessor
Application-Specific
Coprocessors
Memory Subsystem
MSM
SRAM
64-Bit
DDR3 EMIF
MSMC
Debug & Trace
Boot ROM
Semaphore
C66x™
CorePac
Power
Management
PLL
L1P
L1D
Cache/RAM Cache/RAM
x3
L2 Memory Cache/RAM
EDMA
1 to 8 Cores @ up to 1.25 GHz
x3
TeraNet
HyperLink
Multicore Navigator
Switch
Ethernet
Switch
SGMII
x2
x4
SRIO
Application
Specific I/O
SPI
UART
x2
PCIe
I2C
GPIO
Application
Specific I/O
Queue
Manager
Packet
DMA
Security
Accelerator
Packet
Accelerator
Network Coprocessor
CorePac
Memory Subsystem
Multicore Navigator
Network Coprocessor
• Provides hardware accelerators to perform
L2, L3, and L4 processing and encryption
that was previously done in software
• Packet Accelerator (PA)
• 8K multiple-in, multiple-out HW
queues
• Single IP address option
• UDP (and TCP) checksum and
selected CRCs
• L2/L3/L4 support
• Quality of Service (QoS)
• Multicast to multiple queues
• Timestamps
• Security Accelerator (SA)
• Hardware encryption, decryption,
and authentication
• Supports IPsec ESP, IPsec AH, SRTP,
and 3GPP protocols
External Interfaces
Application-Specific
Coprocessors
Memory Subsystem
MSM
SRAM
64-Bit
DDR3 EMIF
MSMC
Debug & Trace
Boot ROM
Semaphore
C66x™
CorePac
Power
Management
PLL
•
•
L1P
L1D
Cache/RAM Cache/RAM
x3
L2 Memory Cache/RAM
EDMA
1 to 8 Cores @ up to 1.25 GHz
x3
TeraNet
HyperLink
Multicore Navigator
Switch
Ethernet
Switch
SGMII
x2
x4
SRIO
Application
Specific I/O
SPI
UART
x2
PCIe
I2C
GPIO
Application
Specific I/O
Queue
Manager
Packet
DMA
Security
Accelerator
Packet
Accelerator
Network Coprocessor
•
•
•
•
•
•
CorePac
Memory Subsystem
Multicore Navigator
Network Coprocessor
External Interfaces
2x SGMII ports support
10/100/1000 Ethernet
4x high-bandwidth Serial RapidIO
(SRIO) lanes for inter-DSP
applications
SPI for boot operations
UART for development/testing
2x PCIe at 5 Gbps
I2C for EPROM at 400 Kbps
16x GPIO pins
Application-specific interfaces
TeraNet Switch Fabric
Application-Specific
Coprocessors
Memory Subsystem
MSM
SRAM
64-Bit
DDR3 EMIF
MSMC
Debug & Trace
Boot ROM
Semaphore
C66x™
CorePac
Power
Management
PLL
L1P
L1D
Cache/RAM Cache/RAM
x3
L2 Memory Cache/RAM
EDMA
1 to 8 Cores @ up to 1.25 GHz
x3
TeraNet
HyperLink
Multicore Navigator
Switch
Ethernet
Switch
SGMII
x2
x4
SRIO
Application
Specific I/O
SPI
UART
x2
PCIe
I2C
GPIO
Application
Specific I/O
Queue
Manager
Packet
DMA
Security
Accelerator
Packet
Accelerator
Network Coprocessor
CorePac
Memory Subsystem
Multicore Navigator
Network Coprocessor
External Interfaces
TeraNet Switch Fabric
• A non-blocking switch fabric that
enables fast and contention-free
internal data movement
• Provides a configured way –
within hardware – to manage
traffic queues and ensure priority
jobs are getting accomplished
while minimizing the involvement
of the CorePac cores
• Facilitates high-bandwidth
communications between
CorePac cores, subsystems,
peripherals, and memory
Diagnostic Enhancements
Application-Specific
Coprocessors
Memory Subsystem
MSM
SRAM
64-Bit
DDR3 EMIF
MSMC
Debug & Trace
Boot ROM
Semaphore
C66x™
CorePac
Power
Management
PLL
L1P
L1D
Cache/RAM Cache/RAM
x3
L2 Memory Cache/RAM
EDMA
1 to 8 Cores @ up to 1.25 GHz
x3
TeraNet
HyperLink
Multicore Navigator
Switch
Ethernet
Switch
SGMII
x2
x4
SRIO
Application
Specific I/O
SPI
UART
x2
PCIe
I2C
GPIO
Application
Specific I/O
Queue
Manager
Packet
DMA
Security
Accelerator
Packet
Accelerator
Network Coprocessor
CorePac
Memory Subsystem
Multicore Navigator
Network Coprocessor
External Interfaces
TeraNet Switch Fabric
Diagnostic Enhancements
• Embedded Trace Buffers (ETB)
enhance the diagnostic capabilities of
the CorePac.
• CP Monitor enables diagnostic
capabilities on data traffic through the
TeraNet switch fabric.
• Automatic statistics collection and
exporting (non-intrusive)
• Monitors individual events for better
debugging
• Monitors transactions to both
memory end point and MemoryMapped Registers (MMR)
• Configurable monitor-filtering
capability based on address and
transaction type
HyperLink Bus
Application-Specific
Coprocessors
Memory Subsystem
MSM
SRAM
64-Bit
DDR3 EMIF
MSMC
Debug & Trace
Boot ROM
Semaphore
C66x™
CorePac
Power
Management
PLL
L1P
L1D
Cache/RAM Cache/RAM
x3
L2 Memory Cache/RAM
EDMA
1 to 8 Cores @ up to 1.25 GHz
x3
TeraNet
HyperLink
Multicore Navigator
Switch
Ethernet
Switch
SGMII
x2
x4
SRIO
Application
Specific I/O
SPI
UART
x2
PCIe
I2C
GPIO
Application
Specific I/O
Queue
Manager
Packet
DMA
Security
Accelerator
Packet
Accelerator
Network Coprocessor
CorePac
Memory Subsystem
Multicore Navigator
Network Coprocessor
External Interfaces
TeraNet Switch Fabric
Diagnostic Enhancements
HyperLink Bus
• Provides the capability to expand
the device to include hardware
acceleration or other auxiliary
processors
• Supports four lanes with up to
12.5 Gbaud per lane
Miscellaneous Elements
Application-Specific
Coprocessors
Memory Subsystem
MSM
SRAM
64-Bit
DDR3 EMIF
MSMC
Debug & Trace
Boot ROM
Semaphore
C66x™
CorePac
Power
Management
PLL
L1P
L1D
Cache/RAM Cache/RAM
x3
L2 Memory Cache/RAM
EDMA
1 to 8 Cores @ up to 1.25 GHz
x3
TeraNet
HyperLink
Multicore Navigator
Switch
Ethernet
Switch
SGMII
x2
x4
SRIO
Application
Specific I/O
SPI
UART
x2
PCIe
I2C
GPIO
Application
Specific I/O
Queue
Manager
Packet
DMA
Security
Accelerator
Packet
Accelerator
Network Coprocessor
CorePac
Memory Subsystem
Multicore Navigator
Network Coprocessor
External Interfaces
TeraNet Switch Fabric
Diagnostic Enhancements
HyperLink Bus
Miscellaneous
• Boot ROM
• Semaphore module provides atomic
access to shared chip-level resources.
• Power management
• Three on-chip PLLs:
– PLL1 for CorePacs (and all
modules except DDR3 and PA)
– PLL2 for DDR3
– PLL3 for Packet Acceleration
• Three EDMA controllers
• Eight 64-bit timers
• Inter-Processor Communication (IPC)
registers
App-Specific: Wireless Applications
C6670
Memory Subsystem
Coprocessors
2MB
MSM
SRAM
64-Bit
DDR3 EMIF
MSMC
RSA
Debug & Trace
RSA
x2
VCP2
x4
Boot ROM
Semaphore
C66x™
CorePac
Power
Management
TCP3d
x2
TCP3e
PLL
32KB L1P
32KB L1D
Cache/RAM Cache/RAM
x3
FFTC
x2
1024KB L2 Cache/RAM
EDMA
BCP
4 Cores @ 1.0 GHz / 1.2 GHz
x3
HyperLink
TeraNet
Multicore Navigator
Switch
Ethernet
Switch
SGMII
2
x4
SRIO
x6
AIF2
SPI
UART
PCIe
I2C
GPIO
x2
Queue
Manager
Packet
DMA
Security
Accelerator
Packet
Accelerator
Network Coprocessor
CorePac
Memory Subsystem
Multicore Navigator
Network Coprocessor
External Interfaces
TeraNet Switch Fabric
Diagnostic Enhancements
HyperLink Bus
Miscellaneous
Application-Specific
Wireless Applications
• Wireless-specific coprocessors:
– 2x FFT Coprocessor (FFTC)
– Turbo Decoder/Encoder
Coprocessor (TCP3D/3E)
– 4x Viterbi Coprocessor (VCP2)
– Bit-rate Coprocessor (BCP)
– 2x Rake Search Accelerator (RSA)
• Wireless-specific interface:
– 6x Antenna Interface 2 (AIF2)
App-Specific: General Purpose
C6671/C6672
C6674/C6678
Memory Subsystem
4MB
MSM
SRAM
64-Bit
DDR3 EMIF
MSMC
Debug & Trace
Boot ROM
Semaphore
C66x™
CorePac
Power
Management
PLL
32KB L1P
32KB L1D
Cache/RAM Cache/RAM
x3
512KB L2 Cache/RAM
EDMA
1 to 8 Cores @ up to 1.25 GHz
x3
HyperLink
TeraNet
Multicore Navigator
Switch
Ethernet
Switch
SGMII
x2
x4
SRIO
x2
TSIP
SPI
UART
x2
PCIe
I2C
GPIO
EMIF 16
Queue
Manager
Packet
DMA
Security
Accelerator
Packet
Accelerator
Network Coprocessor
CorePac
Memory Subsystem
Multicore Navigator
Network Coprocessor
External Interfaces
TeraNet Switch Fabric
Diagnostic Enhancements
HyperLink Bus
Miscellaneous
Application-Specific
General Purpose Applications
General purpose application interfaces:
• 2x Telecommunications Serial Port (TSIP)
• EMIF 16 (EMIF-A) :
– Connects memory up to 256 MB
– Three modes:
• Synchronized SRAM
• NAND flash
• NOR flash
Low-Power Low-Cost
KeyStone C665x Sub-family
KeyStone C6655/57: Device Features
Smart Reflex Enabled
40 nm High-Performance Process
2nd core, C6657 only
Semaphore
C66x™
CorePac
Timers
Security /
Key Manager
Power
Management
32KB L1
P-Cache
PLL
Coprocessors
32KB L1
D-Cache
TCP3d
1024KB L2 Cache
x2
VCP2
EDMA
x2
1 or 2 Cores @ up to 1.25 GHz
TeraNet
HyperLink
Multicore Navigator
x4
SRIO
x2
PCIe
x2
McBSP
x2
Queue
Manager
SPI
Embedded Trace Buffer (ETB) and
System Trace Buffer (STB)
Boot ROM
UART
Interfaces
– High-speed Hyperlink bus
– One 10/100/1000 Ethernet SGMII port
– 4x Serial RapidIO (SRIO) Rev 2.1
– 2x PCIe Gen2
– 2x Multichannel Buffered Serial Ports (McBSP)
– One Asynchronous Memory Interface (EMIF16)
– Additional Serials: SPI, I2C, UPP, GPIO, UART
Debug & Trace
I2C
Multicore Navigator
– Queue Manager (8192 hardware queues)
– Packet-based DMA
MSMC
UPP
Hardware Coprocessors
– Turbo Coprocessor Decoder (TCP3d)
– 2x Viterbi Coprocessors (VCP2)
1MB
MSM
SRAM
32-Bit
DDR3 EMIF
GPIO
Memory Subsystem
– 1 MB Local L2 memory per core
– Multicore Shared Memory Controller (MSMC)
– 32-bit DDR3 Interface
C6655/57
Memory Subsystem
EMIF16
C66x CorePac
– C6655: One C66x CorePac DSP Core
at 1.0 or 1.25 GHz
– C6657: Two C66x CorePac DSP Cores
at 0.85, 1.0, or 1.25 GHz
– Fixed and Floating Point Operations
– Backward-compatible with C64x+ and C67x+ cores
Ethernet
MAC
SGMII
Packet
DMA
KeyStone C6654: Power Optimized
C66x CorePac
– C6654: One CorePac DSP Core at 850 MHz
– Fixed and Floating Point Operations
– Backward compatible with C64x+ and C67x+ cores
Memory Subsystem
– 1 MB Local L2 memory
– Multicore Shared Memory Controller (MSMC)
– 32-bit DDR3 Interface
Multicore Navigator
– Queue Manager (8192 hardware queues)
– Packet-based DMA
Interfaces
– One 10/100/1000 Ethernet SGMII port
– 2x PCIe Gen2
– 2x Multichannel Buffered Serial Ports (McBSP)
– One Asynchronous Memory Interface (EMIF16)
– Additional Serials: SPI, I2C, UPP, GPIO, UART
C6654
Memory Subsystem
32-Bit
DDR3 EMIF
MSMC
Debug & Trace
Boot ROM
Semaphore
C66x™
CorePac
Timers
Security /
Key Manager
Power
Management
32KB L1
P-Cache
PLL
32KB L1
D-Cache
1024KB L2 Cache
x2
EDMA
1 Core @ 850 MHz
TeraNet
Embedded Trace Buffer (ETB) and
System Trace Buffer (STB)
Multicore Navigator
Queue
Manager
Smart Reflex Enabled
x2
PCIe
x2
McBSP
SPI
UART
I2C
UPP
GPIO
EMIF16
x2
40 nm High-Performance Process
Ethernet
MAC
SGMII
Packet
DMA
KeyStone C665x: Key HW Variations
HW Feature
C6654
C6655
CorePac Frequency (GHz)
0.85
1 @ 1.0, 1.25
Multicore Shared Memory (MSM)
No
1024KB SRAM
1066
1333
Serial Rapid I/O Lanes
No
4x
HyperLink
No
Yes
Viterbi Coprocessor (VCP)
No
2x
Turbo Coprocessor Decoder (TCP3d)
No
Yes
DDR3 Maximum Data Rate
C6657
2 @ 0.85, 1.0, 1.25
For More Information
• For more information, refer to the
C66x Getting Started page to locate the data
manual for your KeyStone device.
• View the complete C66x Multicore SOC Online
Training for KeyStone Devices, including
details on the individual modules.
• For questions regarding topics covered in this
training, visit the support forums at the
TI E2E Community website.
Additional Information
Memory Subsystem – Additional Information
Memory subsystem provides:
• Address extension/translation
• Memory protection for addresses outside C66x
• Shared memory access path
• Cache and pre-fetch support
Two Register Sets:
• MPAX registers – Memory Protection and Extension Registers (16)
• MAR registers – Memory Attributes registers (256)
Each CorePac has its own set of MPAX and MAR registers!
Multicore Navigator - Additional Information
Queue Interrupts
Link RAM
Host
(App SW)
Buffer Memory
Queue Man register I/F
PKTDMA register I/F
Accumulator command I/F
L2 or DDR
Descriptor RAMs
Accumulation Memory
VBUS
Hardware Block
PKTDMA
Rx Coh
Unit
QMSS
Rx Core
Tx Core
Timer
Timer
PKTDMA
Tx Scheduling
Control
(internal)
APDSP
APDSP
(Accum)
(Monitor)
Config RAM
Interrupt Distributor
Register I/F
Rx Channel
Ctrl / Fifos
Tx Channel
Ctrl / Fifos
Tx DMA
Scheduler
Queue Interrupts
queue pend
Rx Streaming I/F Tx Streaming I/F
Output
(egress)
Input
(ingress)
PKTDMA Control
Tx Scheduling I/F
(AIF2 only)
Queue
Manager
queue
pend
Config RAM
Register I/F
Link RAM
(internal)
Network Coprocessor (Logical)
Additional Information
PKTDMA Queue
QMSS FIFO Queue
Lookup Engine
(IPSEC16
entries, 32 IP,
16 Ethernet)
Packet Accelerator
SRIO
message RX
Ethernet
RX MAC
Classify
Pass 1
RX
PKTDMA
Security
Accelerator
Ingress Path
(cp_ace)
Egress Path
TX
PKTDMA
Classify
Pass 2
Modify
Modify
Ethernet
TX
Ethernet
MAC
TX
MAC
SRIO
message TX
CorePac 0
DSP 0
DSP0 0
DSP
External Interfaces - Additional Information
Common Interfaces
• One PCI Express (PCIe) Gen II port
–
–
–
–
Two lanes running at 5G Baud
Support for root complex (host) mode and end point mode
Single Virtual Channel (VC) and up to eight Traffic Classes (TC)
Hot plug
• Universal Asynchronous Receiver/Transmitter (UART)
• Two SGMII ports with embedded switch
–
–
–
–
–
–
–
–
Supports IEEE1588 timing over Ethernet
Supports 1G/100 Mbps full duplex
Supports 10/100 Mbps half duplex
Inter-working with RapidIO message
Integrated with packet accelerator for efficient IPv6 support
Supports jumbo packets (9 Kb)
Three-port embedded Ethernet switch with packet forwarding
Reset isolation with SGMII ports and embedded ETH switch
– 2.4, 4.8, 9.6, 19.2, 38.4, 56, and 128 K baud rate
• Serial Port Interface (SPI)
– Operate at up to 66 MHz
– Two-chip select
– Master mode
• Inter IC Control Module (I2C)
– One for connecting EPROM (up to 4Mbit)
– 400 Kbps throughput
– Full 7-bit address field
• General Purpose IO (GPIO) module
Application-Specific Interfaces
For Wireless Applications
• Antenna Interface 2 (AIF2)
– Multiple-standard support (WCDMA, LTE, WiMAX, GSM/Edge)
– Generic packet interface (~12Gbits/sec ingress & egress)
– Frame Sync module (adapted for WiMAX, LTE & GSM
slots/frames/symbols boundaries)
– Reset Isolation
– 16-bit operation
– Can be configured as interrupt pin
– Interrupt can select either rising edge or falling edge
• Serial RapidIO (SRIO)
– RapidIO 2.1 compliant
– Four lanes @ 5 Gbps
• 1.25/2.5/3.125/5 Gbps operation per lane
• Configurable as four 1x, two 2x, or one 4x
– Direct I/O and message passing (VBUSM slave)
– Packet forwarding
– Improved support for dual-ring daisy-chain
– Reset isolation
– Upgrades for inter-operation with packet accelerator
For Media Gateway Applications
• Telecommunications Serial Port (TSIP)
– Two TSIP ports for interfacing TDM applications
– Supports 2/4/8 lanes at 32.768/16.384/8.192 Mbps per lane & up to
1024 DS0s
• EMIF 16 (256MB)
– NAND
– NOR
– Synchronized SRAM
Serial RapidIO - Additional Information
•
•
•
SRIO or RapidIO provides a 3-layered architecture
– Physical defines electrical characteristics, link flow control (CRC)
– Transport defines addressing scheme (8b/16b device IDs)
– Logical defines packet format and operational protocol
Two basic modes of logical layer operation
– DirectIO
• Transmit device needs knowledge of memory map of receiving device
• Includes NREAD, NWRITE_R, NWRITE, SWRITE
• Functional units: LSU, MAU, AMU
– Message Passing
• Transmit Device does not need knowledge of memory map of receiving device
• Includes Type 11 messages and Type 9 packets
• Functional units: TXU, RXU
Gen 2 Implementation – Supporting up to 5 Gbps
TeraNet - Additional Information
S
M
TPCC
TC0 M
16ch QDMA TC1 M
EDMA_0
S DDR3
CPUCLK/2
256bit TeraNet
HyperLink
HyperLink
S Shared L2
S S S S
XMC
SRIO
L2
0-3 M
M
SS Core
Core
S
M
S Core M
M
M
Network M
Coprocessor
S
TAC_FE
M
M
M
M
M
RAC_BE0,1
RAC_BE0,1 MM
FFTC / PktDMA M
FFTC / PktDMA M
AIF / PktDMA M
QMSS
M
PCIe
M
DebugSS
M
SRIO
CPUCLK/3
128bit TeraNet
TC2 M
TPCC
M
TC6
TPCC TC3
64ch
TC4TC7
M
64ch
QDMA TC5TC8
M
QDMA TC9
EDMA_1,2
S TCP3e_W/R
S
TCP3d
TCP3d
S
S TAC_BE
S
S
RAC_FE
RAC_FE
S SVCP2
(x4)
(x4)
SVCP2
SVCP2
VCP2(x4)
(x4)
S
QMSS
S
PCIe
M
MSMC
M
DDR3
• Facilitates high-bandwidth
communication links between
DSP cores, subsystems,
peripherals, and memories.
• Supports parallel orthogonal
communication links
Debug – Additional Information
• Multicore emulation support, host tooling, can halt any or all of
the cores on the device.
– Each core supports a direct connection to the JTAG interface.
– Emulation has full visibility of the CorePac memory map.
• Adds third mode of running, halt, in response to “critical”
interrupts
• Supports core and system trace into different trace buffers (4K,
32K) or external receiver(up to 2G on XDS560v2 Pro)
• Ability to dynamically drain trace buffers from the application
• Advanced Event Triggering (AET) allows the user to identify and
trigger on events of interest from the code or the debugger.
• Common Platform Trace (CP Tracer) provides statistical gathering
into trace buffer for various slave interfaces. Enables profiling,
identification of bottlenecks, and instrumentation.
Miscellaneous Elements –Additional Information
• Support to assert NMI (Non-maskable
Interrupt) input for each core; Separate
hardware pins for NMI and core selector.
• Support for local reset for each core;
Separate hardware pins for local reset and
core selector.
EDMA – Additional Information
Three EDMA Channel Controllers:
•
•
•
One controller in CPU/2 domain:
– Two transfer
controllers/queues with 1KB
channel buffer
– Eight QDMA channels
– 16 interrupt channels
– 128 PaRAM entries
Two controllers in CPU/3 domain:
Each includes the following:
– Four transfer
controllers/queues with 1KB
or 512B channel buffer
– Eight QDMA channels
– 64 interrupt channels
– 512 PaRAM entries
Interrupt generation
– Transfer completion
– Error conditions
510
511
FFT Coprocessor (FFTC) - Additional Information
•
•
•
•
•
•
•
•
The FFTC has been designed to be compatible with various OFDM-based
wireless standards like WiMax and LTE up to 8192 16-bit I/Q.
Packet DMA (PKTDMA) is used to move data in and out of the FFTC module.
The FFTC supports four input (Tx) queues that are serviced in a round-robin
fashion.
LTE 7.5 kHz frequency shift
Dynamic and programmable scaling modes
– Dynamic scaling mode returns block exponent
Support for left-right FFT shift (switch the left/right halves)
Support for variable FFT shift
– For OFDM (Orthogonal Frequency Division Multiplexing) downlink,
supports data format with DC subcarrier in the middle of the subcarriers
Support for cyclic prefix
– Addition and removal
– Any length supported
Turbo CoProcessor 3 Decoder (TCP3D)
Additional Information
• Programmable peripheral for decoding of 3GPP (WCDMA, HSUPA, HSUPA+,
TD_SCDMA), LTE, and WiMax turbo codes.
LTE Bit Processing
Per Transport Block
Soft Bits
De-Scrambling
Per Code Block
Channel
De-interleaver
LLR
combining
LLR Data
• Systematic
• Parity 0
• Parity 1
Hard decision
Decoded bits
TB CRC
De-Rate
Matching
TCP3D
Turbo CoProcessor 3 Encoder (TCP3E) –
Additional Information
• TCP3E = Turbo CoProcessor 3 Encoder
• 3GPP, WiMAX and LTE encoding
– 3GPP includes: WCDMA, HSDPA, and TD-SCDMA
– No previous versions, but came out at same time as third
version of decoder co-processor (TCP3D)
– Performs Turbo Encoding for forward error correction of
transmitted information (downlink for basestation), adds
redundant data to transmitted message
Turbo Encoder
(TCP3E)
Downlink
Turbo Decoder
in Handset
Bit Rate Coprocessor (BCP) – Additional
Information
• The Bit Rate Coprocessor (BCP) is a programmable peripheral for baseband
bit processing.
• Integrated into the TI DSP, the BCP supports FDD LTE, TDD LTE, WCDMA,
TD-SCDMA, HSPA, HSPA+, WiMAX 802.16-2009 (802.16e), and
monitoring/planning for LTE-A.
• Primary functionalities of the BCP peripheral include the following:
•
•
•
•
•
•
•
•
•
•
•
•
•
CRC
Turbo / convolutional encoding
Rate Matching (hard and soft) / rate de-matching
LLR combining
Modulation (hard and soft)
Interleaving / de-interleaving
Scrambling / de-scrambling
Correlation (final de-spreading for WCDMA RX and PUCCH correlation)
Soft slicing (soft demodulation)
128-bit Navigator interface
Two 128-bit direct I/O interfaces
Runs in parallel with DSP
Internal debug logging
Viterbi Decoder Coprocessor (VCP2) –
Additional Information
•
•
•
•
•
•
Variable constraint length, K=5,6,7,8, or 9
User-supplied code coefficients
1/2 , 1/3 or 1/4 code rate
Configurable trace back settings (convergence distance, frame structure)
Branch metrics calculations and de-puncturing done in software by DSP
Communication to and from cores is done using EDMA3