Transcript soc5

On-Chip Communication
(Architecture and Design)
Sungjoo Yoo
ISRC, SNU
Contents
 Part 1
Introduction to on-chip communication
On-chip communication architecture
Software architecture
Hardware architecture
On-chip communication networks
 Part 2
Analysis and optimization of on-chip communication
network
On-chip communication design on unreliable
interconnect
Open issues and summary
Part 1
Introduction
On-chip communication design
High-level functional
specification
SoC Implementation of
on-chip communication
architecture
M1
M3
M2
mP
IP
MM1
1
M3
SW wr.
HW wr.
HW wr.
Physical Communication Network
Designer’s Objectives and
Problems
 High-performance
What is the maximum bandwidth of wire?
What is the best suited OCA?
 Low power consumption
What is the minimum energy required to send the
given amount of data?
How to achieve the minimum energy?
 Small HW/SW overhead
Interconnection and transceiver
 Conflicting objectives
Trade-offs
Incremental Refinement of
On-Chip Communication
Specification of On-Chip
Communication
Abstraction levels of on-chip
communication
Client/server level
Message level
Transaction level
Implementation level
Client/Server Level
 Concept
Service request/provide relation
A client component demands a service from server(s).
Service provider component may not be fixed and can be
determined dynamically
Object request broker (ORB) is needed.
 Real example
Modem service
PDA device: baseband modem  vocoder
Modem service can be Bluetooth, IEEE802.11, CDMA2000,
GPS, etc. depending on the location of PDA device.
Indoor: Bluetooth or IEEE802.11
Outdoor: IEEE802.11 (short range) or CDMA2000
Message Level
Concept
Components communicate with each other
via messages.
Message sender/receiver are fixed.
A message can have any type of data.
Real example
PDA: In the CDMA2000 mode, the vocoder
sends messages to the CDMA2000 modem.
A message has a frame of voice data and
control info.
Transaction Level
 Concept
Components are mapped on real processors.
Communication is mapped on abstract communication
networks.
Communication protocols are fixed.
Transaction can be read, write, burst_read,
burst_write, etc.
For each candidate of real communication networks, the
transaction performance can be analyzed.
 Real example
PDA: vocoder on a DSP, modem on an IP, candidate
communication networks (AMBA, Sonics, IBM, ...)
Determine bus priorities, packet priorities, TDMA slot
assignment, etc.
Implementation Level


On-chip communication architecture is implemented.
Software and hardware architecture
Local memory
w/ I/D caches
Application SW
mP, DSP
Middleware
OS
DMA
SW architecture
Device drivers
Processor
local bus
HW architecture
Adapter
HW IP
Memory
Adapter
Adapter
Communication network
(OCBs w/ bridges, Sonics, packet/circuit switch, etc.)
On-Chip Communication
Architecture
Software
Middleware, OS, device driver and ISR,
memory instructions
Hardware
DMA, (bus) adapter, communication
network (OCBs and bridges, packet
network, etc.), memory
Software On-Chip
CommunicationArchitecture
 Middleware: CORBA, COM+, JAVA, BREW
Service resolution
ORB implementation
Dynamic reconfiguration of services needs to be
supported.
802.11 baseband modem in HW -->
Bluetooth in SW
 Operating system
Communication services
pipe, shared memory, semaphore, mutex,
etc.
Supported as OS system calls
Software On-Chip
Communication Architecture
 Device driver and ISR
The device driver depends on OS and the processor
OS
• Preemptive or not, interrupt or not, synchronization
services (semaphore, lock var, …)
Processor
• Bus width, register set, exception behavior, etc.
 Memory instructions
Load/store, load multiple/store multiple instructions
Cache/virtual memory instructions in ARM v6 architecture
Hardware On-Chip
Communication Architecture
 DMA (Direct Memory Access)
Block size
 Adapter
Basic functionality: protocol conversion
E.g. VCI -- AMBA
Local communication architecture
Distributed bus arbitration/network routing: e.g. Sonics,
packet switch network
mP
mP
IP
MM1
1
M3
OS
Adapter
Adapter
AMBA
M4
IP(mP) adapter
OS
Adapter
CoreConnect
Ch. adp
Ch. adp
Hardware On-Chip
Communication Architecture
Communication network
On-chip bus
AMBA, CoreConnect, PI, etc.
Sonics mNetwork
On-chip communication network
Circuit switch
• Philips
Packet switch
• W. Dally (DAC01), Guerrir (DATE00)
Hardware On-Chip
Communication Architecture
 On-chip memory
Shared memory
E.g. external SDRAM in multimedia chips
Distributed memory w/ caches: e.g. Daytona architecture
Four 64-bit processing elements (PE’s)
Each PE
- 32-bit RISC with DSP enhancements
- 64-bit vector co-processor (four
MAC’s)
Split-transaction bus
- Shared memory based on L1 cache
snooping
- Caches reduce bus traffic.
Embedded RTOS dynamically schedules
tasks.
120mm2, 0.35m, 100MHz
Hardware On-Chip
Communication Architecture
On-chip memory (cont’d)
On-chip implementation of linked list
Philips, DATE01
Data transfer and storage exploration
(DTSE)
IMEC
• Focus on low power consumption and area of
memory
On-Chip Communication
Networks
Routing
Sonics mNetwork SiliconBackplane
Philips, Circuit Switch Network
Packet Switch Networks, Guerrir, DATE00
Network topologies
Mesh, W. Dally, DAC2001
Octagon, ST Microelectronics, DAC2001
Sonics mNetwork
SiliconBackplane
On-chip bus
Time-division multiple access (TDMA)
Pre-characterized on-chip
bus agent
Two-step Arbitration
 Originally assigned module  TDMA
If no bus access  priority-based
Pipelined TDMA Bus
Arbitration
 Pipeline depth
Based on memory target latency at the desired clock
frequency
Design Example: CarrierClass VOIPProcessing Card
DSP + CPU banks + IO + DRAM
DSP: ~16 processors
voice and modem protocols
LEC
CPU: ~4 processors
Packet protocols
Control (call setup)
Hi BW SDRAM
Communication Bandwidth
Requirements: Basic I/O
IO traffic is low BW
Data IO rates
= 1000 ch x 64kb/s x 3 full duplex
= 48MB/s (worst case)
Data are buffered to SDRAM
Communication Bandwidth
Requirements: Cache Updates
CPU cache swap
-assuming 1.6MIPS/channel
-Total BW requirements:
48 + 600 + 320 = 968 (MB/s)
mNetwork Implementation
Derivative Design Example
-Full G.168 LEC uses a specialized core
-LEC has local 4MB memory
-# of channels: 1000  2000
-Increased traffic
-Bus width: 64  128 (bits)
Circuit Switch Network:
Philips PROPHID Architecture
Focus on high-throughput signal
processing for multimedia applications
Requirements
High computation capacity and high communication bandwidth
Performance and programmability
PROPHID
Heterogeneous multi-processor architecture consisting
of general and application specific processors
General purpose processor
Control and low-medium signal processing
Application specific processors
High performance signal processing
Philips Multi-window TV
application
PNX8500
PROPHID architecture
PROPHID: An Architecture
Template For high throughput: ~ 10 Gbits/s
and reconfigurable connection
(switch matrix, 20 proc’s, 64MHz)
Programmability
and control app’s
~10 GOPS
Control-oriented bus
Autonomous tasks based on
data-driven execution
PROPHID: Autonomous
Execution of ADS Processors
- Autonomous task execution on Application Domain
Specific (ADS) processors
- Steam-based execution
- Data-availability determines the execution of tasks.
- Master(CPU)-slave synchronization can be avoided.
Khan Process Network Model
of Multi-window Application
Communication
Infrastructure
Processor Model and
Surrounding Shell
Circuit Switch Network
Guaranteeing the throughput of streams
with hard-real-time constraints in the
PROPHID architecture.
Requirements of task execution on ADS
processors
Time-interleaved task execution
Each task requires input/output FIFO’s.
Circuit Switch Network
Network Topology
Time-Space-Time Routing
High-Performance
Communication Network in
PROPHID Architecture
time
space
time
Chip Photo and Metrics
A Generic Architecture for On-Chip
Packet-Switched Interconnections,
DATE 2000.
 A scalable system-level interconnection template is presented.
A Generic Architecture for
On-Chip Packet-Switched
Interconnections
 Bus-based architecture will not meet the bandwidth
requirements, since
it is inherently non-scalable in terms of bandwidth
Bandwidth is shared by connected comp’s.
 Multiple on-chip bus approaches like VSIA
case-specific grouping of IP’s
Not a truly scalable and reusable interconnection.
 In this paper, a generic interconnection template is
presented.
A Generic Architecture for
On-Chip Packet-Switched
Interconnections
Switching networks
Circuit switching
like PROPHID communication network
High performance
Drawbacks
• lack of reactivity against rapidly changing comm.
– E.g. data bursts in MPEG (worst case should be
assumed.), random traffic between CPU master and
slaves.
Packet switching
Packets are transferred by routers like Internet.
Routing decisions are distributed over the routers, the
network can remain very reactive.
Packet Routing
Wormhole routing
Network Topology: Fat-tree
Network
-Ex. 16 terminals: 8 --> 8 communication
-The terminals can be processors, DSPs, memory, etc.
- Routers are free to use any of the available paths
- Packet: a sequence of 32 bit words
- Packet payload may be of any size
Scalability of Fat-Tree
Network
Scaling and Protocol Stack
Real Implementation
Network Costs and Latency
- One drawback of packet-switched
network
--> inherently arbitrary delay
Pros and Cons:
Bus versus Network
Structured On-Chip
Communication Network,
DAC2001
Why structured network?
Global routing on SoC is hard to
characterized and design.
It would be better to have electrically well
characterized wiring.
-Top 2 metal layers are used
-2D folded torus topology
-Each tile can have processor, DSP,
memory, I/O, etc.
-256bit data line
-Virtual channel support
Router Architecture
Real Implementation
0.1m CMOS
Router overhead
Eight virtual channels at each edge of tile
4 flits x 300b/flit = 1200 b
Each tile has ~5kB (=4 x 1200 b) buffer
storage
Metal routing: 50mm x 3mm
Total router overhead: 6.6% (0.59mm2)
Network Processor Design:
ST Microelectronics Octagon
 OC-768
40Gbps
114x106 packets/s, 44B/packet
Processing requirement
1/114x106 = 9ns/packet
1 packet needs 500 instructions execution
57GIPS
• No single processor!
• Multiprocessors w/ high communication BW
 Communication network for multiprocessor
SoC of OC-768
Octagon
ST Microelectronics
Octagon
Octagon
Cross Bar
Node Model
Scaling and Comparison
with Cross Bar
Summary
 Introduction to on-chip communication
 On-chip communication architecture
Software architecture
Hardware architecture
 On-chip communication networks
Routing
Topology
 Part 2 will treat
Analysis and optimization of on-chip communication network
On-chip communication design on unreliable interconnect
Open issues and summary
Part 2
Analysis of On-Chip
Communication
Analysis
Quality of service, runtime, power
consumption, etc.
Modeling of architecture components
OS modeling
Communication network modeling
On-chip bus
Packet switch network
Analysis of On-Chip
Communication
Given communication network and
mapping
Trace-based
S. Dey
Worst-case
R. Ernst : SW + HW
Statistical analysis
Queueing theory in packet switch network
Other modeling methods
Performance Analysis of
On-Chip Communication
 Analysis with synthetic statistical testbenchs
Hierarchical bus, TDMA, Ring
ICVD'00
 Trace-based analysis
Hierarchical bus
ICCAD99, SiPS
 Queueing theory
Circuit, packet switch
DAC01
Optimization of On-Chip
Communication
HW architecture
Communication resource management
Mapping, (reconfigurable) interconnection
(topology), scheduling and routing
Performance and power
Modulation/demodulation
Power
Average performance
SW architecture
Optimization of On-Chip
Communication Network
 On-chip bus design
Gajski
Daveau
Glesner
 Mapping and interconnection topology
S. Dey, ICCAD00
Potkojnak, ICCAD00
Pedram, DATE00
Others for low power, DAC2001
Optimization of On-Chip
Communication Network
 Scheduling and routing
S. Dey: ICCAD, DAC (CAT, reconfigurable)
W/ mapping and interconnection topology
Circuit switch
Comm. arch. Book
Packet switch
Comm. arch. Book
Octagon
 For better optimization, Not physical module basis, but
virtual channel or message basis!
Optimization in SW On-Chip
Comm. Architecture Design
Middleware, OS, device driver
Minimum service implementation
Component-based middleware/OS design
Pebble, GO!, …
TIMA
JAVA-based implementation
JavaOS and JVM
Application-specific implementation
BREW
On-chip communication on
unreliable interconnect
Encoding/decoding
Low-power bus encoding, DAC, Benini
Communication on unreliable
communication media
CDMA style
To maintain average/statistical performance
Find the paper
“Designing Systems-on-Chip
Using Cores”, DAC 2000.
R. A. Bergamaschi and W. R. Lee,
 The problem of assembling SoC’s using IP blocks
error-prone, labor-intensive, timing-consuming
since the designer should understand
the functionality
interfaces
electrical characteristics of cores such as
processors, mem. controllers, bus arbiters, etc.
Moreover, cores are parameterized and need to be
configured according to their use in the SoC.
Designing Systems-on-Chip
Using Cores
 With the VSIA’s Virtual Component Interfaces, the
designer still has to do
wrapper design
architecture design
assembling the SoC using VCI’s and wrappers
 A digression: two key points in our design flow
application specific wrapper (comm. co-processor)
design
application specific architecture design flow
Designing Systems-on-Chip
Using Cores
Designing Systems-on-Chip
Using Cores
 Designers’ tasks to configure the bus architecture
define the cores to be used
32, 64, 128 bit bus, proc. charateristics, HW/SW
understand the functionality of all pins on all cores
and determine their connections
define request priorities, e.g. interrupt priorities
define the usage of DMA
define address maps
define clock domains
insert glue logic
insert/configure test logic
 There has been no tool to automate those tasks.
Designing Systems-on-Chip
Using Cores
Automating SoC integration: 6 steps
1. Virtual design
Virtual component (VC) is a representation of a class of real
components.
E.g. PowerPC VC represents all real PowerPC cores (e.g.
401, 405, etc.).
Virtual interface is used instead of real interface.
• Smaller number of interface pins
2. Glueless interface
Automatic generation of glue logic
• First, include necessary glue logic into the core.
• Remaining minor glue logic is automatically generated.
Designing Systems-on-Chip
Using Cores
Automating SoC integration: 6 steps
3. Core and pin properties
encode the structural and functional characteristics of a
component and its pins.
Properties attached to all components and pins
Automatic pin connection algorithm is used.
Properties
•
•
•
•
•
BUS_TYPE: ASB, APB, etc.
INTERFACE_TYPE: MASTER, SLAVE
FUNCTION_TYPE: READ, WRITE, INTERRUPT
OPERATION_TYPE: REQUEST, ACKNOWLEDGE
DATA_TYPE, RESOURCE_TYPE
Designing Systems-on-Chip
Using Cores
Automating SoC integration: 6 steps
4. Interconnection engine
5. Virtual to real synthesis
Designing Systems-on-Chip
Using Cores
Automating SoC integration: 6 steps
6. Configuration engine
clocking, address map, interrupt map, DMA channel
assignment, etc.
Comments
To free the designer from pin interconnection and
glue logic design
Limitation
Automation applies to HW Module interface only at pin level
(with a fixed target architecture)
No SW module interfacing (i.e. targeting and processor
interfacing) is not considered.
Open Issues
Architectural trade-off
HW/SW trade-off
in middleware and OS service implementation
Communication network design
Prioritized packet network design
Interconnection topology design with
physical DSM effects
Open Issues
Reconfigurable on-chip communication
In connection with component-based SoC
design
On-chip communication design w/
unreliable media
Unreliable physical wiring and environment
Summary