CourseCode L1T1H1-10 CellArchitecture

Download Report

Transcript CourseCode L1T1H1-10 CellArchitecture

Systems and Technology Group
Cell Architecture
Course code: L1T1H1-10
Cell Ecosystem Solutions Enablement
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Class Objectives – Things you will learn
 Cell history and cell design motivation
 How cell overcomes three important limiters of
contemporary microprocessor performance—power use,
memory use, and processor frequency
 Cell processor organization and components
– Power processor element, block diagram, PXU pipeline
– Synergistic processor element, block diagram, SXU pipeline
– Memory flow controller and MFC commands
– Element interconnect bus, command and data topology
– I/O and memory interfaces
– Resource allocation management
2
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Class Agenda

Cell history

Cell highlights

Introducing cell

Performance over time

Cell concept

Microprocessor architecture trends

Architecture motivators

Cell synergy

Cell features

Cell processor components
– Power processor element
– Synergistic processor element
– Memory flow controller
– Element interconnect bus
– I/O and memory interfaces
– Resource allocation management
References

Jim Kahle, Cell Broadband Engine and Cell Broadband Engine Architecture
Trademarks - Cell Broadband Engine ™ is a trademark of Sony Computer Entertainment, Inc.
3
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Cell
4
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Cell History






IBM, SCEI/Sony, Toshiba Alliance formed in 2000
Design Center opened in March 2001
Based in Austin, Texas
February 7, 2005: First technical disclosures
May 16, 2005: First public demonstrations at E3
August 25, 2005: Release of technical documentation
– Cell Broadband Engine Architecture documentation can be found at:
 http://www.ibm.com/developerworks/power/cell
– Additional publications on Cell can be downloaded from:
 http://www.ibm.com/chips/techlib/techlib.nsf/products/Cell
 http://www.power.org/resources/devcorner/cellcorner
– A paper on Cell in the IBM Journal of Research and Development can be found at:
 http://www.research.ibm.com/journal/rd/494/kahle.html
5
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Cell Highlights
 Supercomputer on a chip
 Multi-core microprocessor (9 cores)
 3.2 GHz clock frequency
 10x performance for many applications
 Digital home to distributed computing
6
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2005
2006 IBM Corporation
Systems and Technology Group
Introducing Cell
7

Cell is an accelerator extension to Power
– Built on a Power ecosystem
– Used best know system practices for processor design

Sets a new performance standard
– Exploits parallelism while achieving high frequency
– Supercomputer attributes with extreme floating point capabilities
– Sustains high memory bandwidth with smart DMA controllers

Designed for natural human interaction
– Photo-realistic effects
– Predictable real-time response
– Virtualized resources for concurrent activities

Designed for flexibility
– Wide variety of application domains
– Highly abstracted to highly exploitable programming models
– Reconfigurable I/O interfaces
– Virtual trusted computing environment for security
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Cell Concept
 Compatibility with 64b Power Architecture™
– Builds on and leverages IBM investment and community
 Increased efficiency and performance
– Attacks on the “Power Wall”
•
•
Non Homogenous Coherent Multiprocessor
High design frequency @ a low operating voltage with advanced power management
– Attacks on the “Memory Wall”
•
•
Streaming DMA architecture
3-level Memory Model: Main Storage, Local Storage, Register Files
– Attacks on the “Frequency Wall”
•
•
Highly optimized implementation
Large shared register files and software controlled branching to allow deeper pipelines
 Interface between user and networked world
– Image rich information, virtual reality
– Flexibility and security
 Multi-OS support, including RTOS / non-RTOS
– Combine real-time and non-real time worlds
8
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Frequency Increase vs Power Consumption
3.5
3
Realative
2.5
2
Pow er
Frequency
1.5
1
0.5
0
0.9
1
1.1
1.2
1.3
Voltage
9
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Architecture Motivators
Market Requirements
ƒ Natural interaction with the system
ƒ Consumer acceptable interaction
ƒ Improve Experience
–Ease of use
–High degree of interaction
–Responsive
–Realism
–Interconnected through network to other devices
Holistic Design Approach
ƒ Architecture
ƒ Hardware implementation
ƒ System structure
ƒ Programming Model
Technical Requirements
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
10
Dual environment: Real time and conventional
High FLOPS Computational density
High parallelism
Bandwidth & latency controls
Realtime response
Resource reservation
High bandwidth
CourseCode:
Code:L1T1H1-10
L1T1H1-10Cell
CellArchitecture
Architecture
Course
04/06/06
© 2005
2006 IBM Corporation
Systems and Technology Group
Cell Synergy
 Cell is not a collection of different processors, but a synergistic
whole
– Operation paradigms, data formats and semantics consistent
– Share address translation and memory protection model
 PPE for operating systems and program control
 SPE optimized for efficient data processing
– SPEs share Cell system functions provided by Power Architecture
– MFC implements interface to memory
• Copy in/copy out to local storage
 PowerPC provides system functions
– Virtualization
– Address translation and protection
– External exception handling
 EIB integrates system as data transport hub
11
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Cell Features
 Heterogeneous multi-core
system architecture
SPE
SPU
SPU
SPU
SPU
SPU
SPU
SXU
SXU
SXU
SXU
SXU
SXU
LS
LS
LS
LS
LS
LS
LS
LS
MFC
MFC
MFC
MFC
MFC
MFC
MFC
MFC
16B/cycle
EIB (up to 96B/cycle)
 Synergistic Processor
Element (SPE) consists of
– Synergistic Processor Unit
(SPU)
SPU
SXU
– Power Processor Element
for control tasks
– Synergistic Processor
Elements for data-intensive
processing
SPU
SXU
16B/cycle
16B/cycle
PPE
PPU
– Synergistic Memory Flow
Control (MFC)
•
Data movement and
synchronization
•
Interface to highperformance Element
Interconnect Bus
L2
L1
MIC
16B/cycle (2x)
BIC
PXU
32B/cycle 16B/cycle
Dual
XDRTM
FlexIOTM
64-bit Power Architecture with VMX
12
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Cell Broadband Engine – 235mm2
13
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Cell Processor Components
Power Processor Element (PPE):
• General purpose, 64-bit RISC
processor (PowerPC AS 2.0.2)
• 2-Way hardware multithreaded
• L1 : 32KB I ; 32KB D
• L2 : 512KB
• Coherent load / store
• VMX-32
• Realtime Controls
– Locking L2 Cache & TLB
– Software / hardware managed TLB
– Bandwidth / Resource Reservation
– Mediated Interrupts
Element Interconnect Bus (EIB):
• Four 16 byte data rings supporting multiple
simultaneous transfers per ring
• 96Bytes/cycle peak bandwidth
• Over 100 outstanding requests
14 .1
Course Code: L1T1H1-10 Cell Architecture
In the Beginning
– the solitary Power Processor
96 Byte/Cycle
NCU
Power Core
(PPE)
L2 Cache
Element Interconnect Bus
Custom Designed
– for high frequency, space,
and power efficiency
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Local Store AUC
N
96 Byte/Cycle
Local Store AUC
N
SPU
NCU
MFC
MFC
SPU
AUC
Local Store
N
Power Core
(PPE)
Local Store AUC
L2 Cache
N
MFC
MFC
SPU
AUC
Local Store
04/06/06
N
MFC
SPU
Local Store AUC
SPU
MFC
N
Element Interconnect Bus
Local Store AUC
SPU
N
Course Code: L1T1H1-10 Cell Architecture
MFC
15 .1
SPU
• SPE Local Store aliased into PPE system memory
• MFC/MMU controls / protects SPE DMA accesses
– Compatible with PowerPC Virtual Memory
Architecture
– SW controllable using PPE MMIO
• DMA 1,2,4,8,16,128 -> 16Kbyte transfers for I/O
access
• Two queues for DMA commands: Proxy & SPU
N
Memory Management & Mapping
MFC
Synergistic Processor Element (SPE):
• Provides the computational performance
• Simple RISC User Mode Architecture
– Dual issue VMX-like
– Graphics SP-Float
– IEEE DP-Float
• Dedicated resources: unified 128x128-bit RF,
256KB Local Store
• Dedicated DMA engine: Up to 16 outstanding
requests
Local Store AUC
SPU
Cell Processor Components
© 2006 IBM Corporation
Systems and Technology Group
Local Store AUC
N
Local Store AUC
N
NCU
MFC
MFC
SPU
AUC
Local Store
N
SPU
Power Core
(PPE)
L2 Cache
N
MFC
MFC
SPU
AUC
Local Store
N
SPU
Memory Interface Controller (MIC):
MFC
SPU
Local Store AUC
MFC
N
04/06/06
SPU
20 GB/sec
BIF or IOIF0
Local Store AUC
IOIF0
N
Element Interconnect Bus
• Dual XDRTM controller (25.6GB/s @ 3.2Gbps)
• ECC support
• Suspend to DRAM support
Course Code: L1T1H1-10 Cell Architecture
MFC
N
96 Byte/Cycle
Local Store AUC
16 .1
MIC
SPU
MFC
Broadband Interface Controller (BIC):
 Provides a wide connection to external devices
 Two configurable interfaces (60GB/s @ 5Gbps)
– Configurable number of bytes
– Coherent (BIF) and / or
I/O (IOIFx) protocols
 Supports two virtual channels per interface
 Supports multiple system configurations
Local Store AUC
SPU
Cell Processor Components
25 GB/sec
XDR DRAM
IOIF1
5 GB/sec
Southbridge
I/O
© 2006 IBM Corporation
Systems and Technology Group
Local Store AUC
MFC
MIC
SPU
MFC
Internal Interrupt Controller (IIC)
N
N
96 Byte/Cycle
Local Store AUC
N
SPU
NCU
MFC
MFC
SPU
AUC
Local Store
N
 Handles SPE Interrupts
 Handles External Interrupts
– From Coherent Interconnect
– From IOIF0 or IOIF1
 Interrupt Priority Level Control
 Interrupt Generation Ports for IPI
 Duplicated for each PPE hardware thread
Local Store AUC
SPU
Cell Processor Components
25 GB/sec
XDR DRAM
Power Core
(PPE)
Local Store AUC
L2 Cache
N
MFC
MFC
SPU
AUC
Local Store
N
SPU
IIC
IOT
17 .1
Course Code: L1T1H1-10 Cell Architecture
04/06/06
N
MFC
SPU
Local Store AUC
 Translates Bus Addresses to System Real Addresses
20 GB/sec
 Two Level Translation
BIF or IOIF0
– I/O Segments (256 MB)
– I/O Pages (4K, 64K, 1M, 16M byte)
 I/O Device Identifier per page for LPAR
 IOST and IOPT Cache – hardware / software managed
MFC
I/O Bus Master Translation (IOT)
SPU
IOIF0
Local Store AUC
N
Element Interconnect Bus
IOIF1
5 GB/sec
Southbridge
I/O
© 2006 IBM Corporation
Systems and Technology Group
Local Store AUC
SPU
MFC
N
96 Byte/Cycle
TKM
Local Store AUC
N
SPU
NCU
MFC
MFC
SPU
AUC
Local Store
N
Power Core
(PPE)
Local Store AUC
L2 Cache
N
SPU
MFC
MFC
SPU
AUC
Local Store
N
Course Code: L1T1H1-10 Cell Architecture
MIC
N
18
MFC
Token Manager (TKM):
IIC
IOT
04/06/06
N
MFC
SPU
Local Store AUC
20 GB/sec
BIF or IOIF0
SPU
IOIF0
MFC
N
Element Interconnect Bus
Local Store AUC
 Bandwidth / Resource Reservation for shared
resources
 Optionally enabled for RT tasks or LPAR
 Multiple Resource Allocation Groups (RAGs)
 Generates access tokens at configurable rate for
each allocation group
– 1 per each memory bank (16 total)
– 2 for each IOIF (4 total)
 Requestors assigned RAG ID by OS / hypervisor
– Each SPE
– PPE L2 / NCU
– IOIF 0 Bus Master
– IOIF 1 Bus Master
 Priority order for using another RAGs unused tokens
 Resource over committed warning interrupt
Local Store AUC
SPU
Cell Processor Components
25 GB/sec
XDR DRAM
IOIF1
5 GB/sec
Southbridge
I/O
© 2006 IBM Corporation
Systems and Technology Group
Power Processor Element

PPE handles operating system and control tasks
– 64-bit Power ArchitectureTM with VMX
– In-order, 2-way hardware simultaneous multi-threading (SMT)
– Coherent Load/Store with 32KB I & D L1 and 512KB L2
19
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
PPE BLOCK DIAGRAM
Pre-Decode
8
Fetch Control
L2
Interface
Branch Scan
L1 Instruction Cache
4
Thread A Thread B
4
SMT Dispatch (Queue)
2
Decode
Dependency
Issue
2
L1 Data Cache
1
1
1
Branch
Load/Store Fixed-Point
Execution
Unit
Unit
Unit
Completion/Flush
1
Microcode
Thread A
Thread B
Thread A
VMX/FPU Issue (Queue)
2
1
1
1
VMX
VMX
FPU
Load/Store/
Arith./Logic Unit Arith/Logic Unit
Permute
VMX Completion
20
Threads alternate
fetch and dispatch
cycles
Course Code: L1T1H1-10 Cell Architecture
04/06/06
1
FPU
Load/Store
FPU Completion
© 2006 IBM Corporation
Systems and Technology Group
PXU PIPELINE FRONT END
Microcode
IC1
IC2
IC3
IC4
IB1
IB2
Instruction Cache and Buffer
BP1
BP2
MC1 MC2 MC3
MC4
ID1
IS1
ID2
ID3
...
IS2
MC9 MC10 MC11
IS3
Instruction Decode and Issue
BP3
BP4
Branch Prediction
PPE PIPELINE BACK END
Branch Instruction
DLY
DLY
DLY
RF1
EX1
EX2
EX3
EX4
IBZ
IC0
RF2
EX1
EX2
EX3
EX4
EX5
WB
EX3
EX4
EX5
EX6
EX7
EX8 WB
RF2
Fixed Point Unit Instruction
DLY
DLY
DLY
RF1
Load/Store Instruction
RF1
21
RF2
EX1
EX2
Course Code: L1T1H1-10 Cell Architecture
04/06/06
IC Instruction Cache
IB Instruction Buffer
BP Branch Prediction
MC Microcode
ID Instruction Decode
IS Instruction Issue
DLY Delay Stage
RF Register File Access
EX Execution
WB Write Back
© 2006 IBM Corporation
Systems and Technology Group
Synergistic Processor Element

SPE provides computational performance
–
–
–
–
22
Dual issue, up to 16-way 128-bit SIMD
Dedicated resources: 128 128-bit RF, 256KB Local Store
Each can be dynamically configured to protect resources
Dedicated DMA engine: Up to 16 outstanding requests
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
SPE Highlights
 RISC like organization
– 32 bit fixed instructions
– Clean design – unified Register file
LS
DP
SFP
 User-mode architecture
FWD
 VMX-like SIMD dataflow
LS
CHANNEL
SBI
SMM
BEB
DMA
– Broad set of operations (8 / 16 / 32 Byte)
– Graphics SP-Float
– IEEE DP-Float
LS
CONTROL
FXU ODD
GPR
– No translation/protection within SPU
– DMA is full Power Arch protect/x-late
LS
FXU EVN
 Unified register file
– 128 entry x 128 bit
ATO
 256KB Local Store
RTB
– Combined I & D
– 16B/cycle L/S bandwidth
– 128B/cycle DMA bandwidth
14.5mm2 (90nm SOI)
23
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
What is a Synergistic Processor?
(and why is it efficient?)
 Local Store “is” large 2nd level register file / private instruction store instead of cache
– Asynchronous transfer (DMA) to shared memory
– Frontal attack on the Memory Wall
 Media Unit turned into a Processor
LS
– Unified (large) Register File
SFP
DP
– 128 entry x 128 bit
 Media & Compute optimized
LS
FXU EVN
SPU
FWD
FXU ODD
– SIMD architecture
GPR
LS
CONTROL
– One context
LS
CHANNEL
SBI
24
Course Code: L1T1H1-10 Cell Architecture
04/06/06
SMM
BEB
DMA
ATO
SMF
RTB
© 2006 IBM Corporation
Systems and Technology Group
SPU Detail
Synergistic Processor Element (SPE)

User-mode architecture
–
No translation/protection within
SPE
SPU Units:
LS
–
Simple (FXU even)
–
•
Add/Compare
•
Rotate
•
Logical, Count Leading Zero
Permute (FXU odd)
–
•
Permute
•
Table-lookup
FPU (Single / Double Precision)
–
Control (SCN)
–
•
Dual Issue, Load/Store, ECC Handling
Channel (SSC) – Interface to MFC
–
Register File (GPR/FWD)
DP
SFP
–
–
Branch hint
VMX-like SIMD dataflow
–
Graphics SP-Float
–



25
FWD
FXU ODD
GPR
No saturate arith, some byte
–
IEEE DP-Float (BlueGene-like)
Unified register file
–
128 entry x 128 bit
256KB Local Store
–
Combined I & D
–
16B/cycle L/S bandwidth
–
128B/cycle DMA bandwidth
Memory Flow Control (MFC)
LS
LS
CHANNEL
DMA
SMM
SBI
BEB

LS
FXU EVN
CONTROL

DMA is full PowerPC
protect/xlate
Direct programmer control
–
DMA/DMA-list
ATO
RTB
SPU Latencies
–
Simple fixed point
- 2 cycles*
–
Complex fixed point
- 4 cycles*
–
Load
- 6 cycles*
–
•
Local store size = 256 KB
Single-precision (ER) float
- 6 cycles*
–
Integer multiply
- 7 cycles*
–
Branch miss
- 20 cycles
–
•
No penalty if correctly hinted
DP (IEEE) float
- 13 cycles*
–
•
Partially pipelined
Enqueue DMA Command
- 20 cycles*
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
SPE BLOCK DIAGRAM
Floating-Point Unit
Permute Unit
Fixed-Point Unit
Load-Store Unit
Branch Unit
Local Store
(256kB)
Single Port SRAM
Channel Unit
Result Forwarding and Staging
Register File
Instruction Issue Unit / Instruction Line Buffer
128B Read
On-Chip Coherent Bus
8 Byte/Cycle
26
16 Byte/Cycle
Course Code: L1T1H1-10 Cell Architecture
128B Write
DMA Unit
64 Byte/Cycle
04/06/06
128 Byte/Cycle
© 2006 IBM Corporation
Systems and Technology Group
SXU PIPELINE FRONT END
IF1
IF2
IF3
IF4
IF5
IB1
IB2
ID1
ID2
ID3
IS1
IS2
SPE PIPELINE BACK END
Branch Instruction
RF1
RF2
Permute Instruction
EX1
EX2 EX3
EX4
WB
Load/Store Instruction
EX1
EX2
EX3
EX4
EX5
EX6
WB
IF
IB
ID
IS
RF
EX
WB
Instruction Fetch
Instruction Buffer
Instruction Decode
Instruction Issue
Register File Access
Execution
Write Back
Fixed Point Instruction
EX1
EX2
WB
Floating Point Instruction
EX1
27
EX2 EX3
EX4 EX5
Course Code: L1T1H1-10 Cell Architecture
EX6
WB
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
MFC Detail
Local
Store
SPU
SPC
Legend:
DMA Engine
Atomic
Facility
Data Bus
Snoop Bus
Control Bus
Xlate Ld/St
MMIO
DMA
Queue
MMU
RMT
Bus I/F Control
MMIO
Memory Flow Control System
•DMA Unit
•LS <-> LS, LS<-> Sys Memory, LS<-> I/O Transfers
•8 PPE-side Command Queue entries
•16 SPU-side Command Queue entries
•MMU similar to PowerPC MMU
•8 SLBs, 256 TLBs
•4K, 64K, 1M, 16M page sizes
•Software/HW page table walk
•PT/SLB misses interrupt PPE
•Atomic Cache Facility
•4 cache lines for atomic updates
•2 cache lines for cast out/MMU reload
•Up to 16 outstanding DMA requests in BIU
•Resource / Bandwidth Management Tables
•Token Based Bus Access Management
•TLB Locking
Isolation Mode Support (Security Feature)
 Hardware enforced “isolation”
–

–
Small LS “untrusted area” for communication area
Secure Boot
–

28
SPU and Local Store not visible (bus or jtag)
Chip Specific Key
–
Decrypt/Authenticate Boot code
“Secure Vault” – Runtime Isolation Support
–
Isolate Load Feature
–
Isolate Exit Feature
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Per SPE Resources (PPE Side)
Problem State
4K Physical Page Boundary
8 Entry MFC Command Queue Interface
DMA Command and Queue Status
DMA Tag Status Query Mask
DMA Tag Status
32 bit Mailbox Status and Data from SPU
32 bit Mailbox Status and Data to SPU
4 deep FIFO
Signal Notification 1
Signal Notification 2
SPU Run Control
SPU Next Program Counter
SPU Execution Status
4K Physical Page Boundary
Optionally Mapped 256K Local Store
29
Privileged 2 State
(OS or Hypervisor)
Privileged 1 State (OS)
4K Physical Page Boundary
4K Physical Page Boundary
SPU Privileged Control
SPU Channel Counter Initialize
SPU Channel Data Initialize
SPU Signal Notification Control
SPU Decrementer Status & Control
MFC DMA Control
MFC Context Save / Restore Registers
SLB Management Registers
4K Physical Page Boundary
Optionally Mapped 256K Local Store
Course Code: L1T1H1-10 Cell Architecture
SPU Master Run Control
SPU ID
SPU ECC Control
SPU ECC Status
SPU ECC Address
SPU 32 bit PU Interrupt Mailbox
MFC Interrupt Mask
MFC Interrupt Status
MFC DMA Privileged Control
MFC Command Error Register
MFC Command Translation Fault Register
MFC SDR (PT Anchor)
MFC ACCR (Address Compare)
MFC DSSR (DSI Status)
MFC DAR (DSI Address)
MFC LPID (logical partition ID)
MFC TLB Management Registers
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Per SPE Resources (SPU Side)
SPU Direct Access Resources
128 - 128 bit GPRs
External Event Status (Channel 0)
Decrementer Event
Tag Status Update Event
DMA Queue Vacancy Event
SPU Incoming Mailbox Event
Signal 1 Notification Event
Signal 2 Notification Event
Reservation Lost Event
External Event Mask (Channel 1)
External Event Acknowledgement (Channel 2)
Signal Notification 1 (Channel 3)
Signal Notificaiton 2 (Channel 4)
Set Decrementer Count (Channel 7)
Read Decrementer Count (Channel 8)
16 Entry MFC Command Queue Interface (Channels 16-21)
DMA Tag Group Query Mask (Channel 22)
Request Tag Status Update (Channel 23)
Immediate
Conditional - ALL
Conditional - ANY
Read DMA Tag Group Status (Channel 24)
DMA List Stall and Notify Tag Status (Channel 25)
DMA List Stall and Notify Tag Acknowledgement (Channel 26)
Lock Line Command Status (Channel 27)
Outgoing Mailbox to PU (Channel 28)
Incoming Mailbox from PU (Channel 29)
Outgoing Interrupt Mailbox to PU (Channel 30)
30
Course Code: L1T1H1-10 Cell Architecture
SPU Indirect Access Resources
(via EA Addressed DMA)
System Memory
Memory Mapped I/O
This SPU Local Store
Other SPU Local Store
Other SPU Signal Registers
Atomic Update (Cacheable Memory)
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Memory Flow Controller Commands
DMA Commands
Put - Transfer from Local Store to EA space
Puts - Transfer and Start SPU execution
Putr - Put Result - (Arch. Scarf into L2)
Putl - Put using DMA List in Local Store
Putrl - Put Result using DMA List in LS (Arch)
Get - Transfer from EA Space to Local Store
Gets - Transfer and Start SPU execution
Getl - Get using DMA List in Local Store
Sndsig - Send Signal to SPU
Command Modifiers: <f,b>
f: Embedded Tag Specific Fence
Command will not start until all previous commands
in same tag group have completed
b: Embedded Tag Specific Barrier
Command and all subsiquent commands in same
tag group will not start until previous commands in same
tag group have completed
SL1 Cache Management Commands
sdcrt - Data cache region touch (DMA Get hint)
sdcrtst - Data cache region touch for store (DMA Put hint)
sdcrz - Data cache region zero
sdcrs - Data cache region store
sdcrf - Data cache region flush
31
Course Code: L1T1H1-10 Cell Architecture
Command Parameters
LSA - Local Store Address (32 bit)
EA - Effective Address (32 or 64 bit)
TS - Transfer Size (16 bytes to 16K bytes)
LS - DMA List Size (8 bytes to 16 K bytes)
TG - Tag Group(5 bit)
CL - Cache Management / Bandwidth Class
Synchronization Commands
Lockline (Atomic Update) Commands:
getllar - DMA 128 bytes from EA to LS and set Reservation
putllc - Conditionally DMA 128 bytes from LS to EA
putlluc - Unconditionally DMA 128 bytes from LS to EA
barrier - all previous commands complete before subsiquent
commands are started
mfcsync - Results of all previous commands in Tag group
are remotely visible
mfceieio - Results of all preceding Puts commands in same
group visible with respect to succeeding Get commands
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
SPE Structure
 Scalar processing supported on data-parallel
substrate
– All instructions are data parallel and operate on vectors
of elements
– Scalar operation defined by instruction use, not opcode
• Vector instruction form used to perform operation
 Preferred slot paradigm
– Scalar arguments to instructions found in “preferred slot”
– Computation can be performed in any slot
32
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Register Scalar Data Layout
 Preferred slot in bytes 0-3
– By convention for procedure interfaces
– Used by instructions expecting scalar data
•
33
Addresses, branch conditions, generate controls for insert
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Element Interconnect Bus

EIB data ring for internal communication
– Four 16 byte data rings, supporting multiple transfers
– 96B/cycle peak bandwidth
– Over 100 outstanding requests
34
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Element Interconnect Bus – Command Topology

“Address Concentrator” tree structure minimizes wiring resources

Single serial command reflection point (AC0)

Address collision detection and prevention

Fully pipelined

Content –aware round robin arbitration

Credit-based flow control
SPE1
PPE
CMD
SPE3
CMD
A
C
3
CMD
CMD
SPE5
CMD
A
C
2
CMD
SPE7
IOIF1
CMD
A
C
1
CMD
A
C
2
AC0
CMD
CMD
CMD
Off-chip AC0
MIC
35
SPE0
SPE2
Course Code: L1T1H1-10 Cell Architecture
SPE4
SPE6
04/06/06
BIF/IOIF0
© 2006 IBM Corporation
Systems and Technology Group
Element Interconnect Bus - Data Topology

Four 16B data rings connecting 12 bus elements
– Two clockwise / Two counter-clockwise

Physically overlaps all processor elements

Central arbiter supports up to three concurrent transfers per data ring
– Two stage, dual round robin arbiter

Each element port simultaneously supports 16B in and 16B out data path
– Ring topology is transparent to element data interface
PPE
SPE1
SPE3
SPE5
SPE7
16B 16B
16B 16B
16B 16B
16B 16B
16B
16B
16B
16B
Data Arb
16B
16B
16B
16B
MIC
36
IOIF1
16B 16B
16B 16B
16B 16B
16B 16B
SPE0
SPE2
SPE4
SPE6
Course Code: L1T1H1-10 Cell Architecture
04/06/06
BIF/IOIF0
© 2006 IBM Corporation
Systems and Technology Group
Internal Bandwidth Capability
 Each EIB Bus data port supports 25.6GBytes/sec* in each
direction
 The EIB Command Bus streams commands fast enough to
support 102.4 GB/sec for coherent commands, and 204.8
GB/sec for non-coherent commands.
 The EIB data rings can sustain 204.8GB/sec for certain
workloads, with transient rates as high as 307.2GB/sec
between bus units
Despite all that available bandwidth…
* The above numbers assume a 3.2GHz core frequency – internal bandwidth scales with core frequency
37
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Example of eight concurrent transactions
PPE
SPE1
SPE3
SPE5
SPE7
IOIF1
Ramp
Ramp
Ramp
Ramp
Ramp
Ramp
Ramp
Ramp
Ramp
Ramp
Ramp
6
7
8
9
9
10
10
11
Controller
Controller
Controller
Controller
Controller
Controller
7
8
11
Data
Arbiter
Controller
Controller
Controller
Controller
Controller
Controller
Controller
Controller
Controller
Controller
Ramp
Ramp
Ramp
Ramp
Ramp
Ramp
Ramp
5
Ramp
4
Ramp
3
Ramp
2
Ramp
1
Ramp
0
9
8
7
Ramp
Ramp
Ramp
Ramp
Ramp
38
10
Ring0
Ring2
11
MIC
PPE
Controller
5
SPE0
SPE1
SPE2
SPE3
SPE4
SPE5
SPE6
SPE7
BIF /
IOIF1
IOIF0
IOIF1
4
3
Ring1
Ring3
Course Code: L1T1H1-10 Cell Architecture
2
1
0
controls
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Resource Allocation Management
39

Optional facility used to minimize over-allocation effects of critical
resources
– Independent but complementary function to the EIB
– Critical (managed) resource’s time is distributed among groups of
requestors

Managed resources include:
– Rambus XDRTM DRAM memory banks (0 to 15)
– BIF/IOIF0 Inbound and BIF/IOIF0 Outbound
– IOIF1 Inbound and IOIF1 Outbound

Requestors Allocated to Four Resource Allocation Groups (RAG)
– 17 requestors – PPE, SPEs, I/O Inbound (4 VCs), I/O Outbound (4 VCs)

Central Token Manager controller
– Requestors ask permission to issue EIB commands to managed resources
– Tokens granted across RAGs allow requestor access to issue command to
the EIB
– Round robin allocation within RAG
– Dynamic software configuration of the Token Manager to adjust token
allocation rates for varying workloads
– Multi-level hardware feedback from managed resource congestion to
throttle token allocation
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
I/O and Memory Interfaces

I/O Provides wide bandwidth
– Dual XDRTM controller (25.6GB/s @ 3.2Gbps)
– Two configurable interfaces (76.8GB/s @6.4Gbps)
• Configurable number of Bytes
• Coherent or I/O Protection
– Allows for multiple system configurations
40
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
Cell BE Processor Can Support Many Systems
XDRtm
IOIF
XDRtm
XDRtm
IOIF
XDRtm
Cell BE
Processor
XDRtm XDRtm
Cell BE
Processor
Cell BE
Processor
XDRtm XDRtm
XDRtm XDRtm
IOIF
Course Code: L1T1H1-10 Cell Architecture
SW
Cell BE
Processor
41
IOIF1
Cell BE
Processor
Cell BE
Processor
IOIF0
BIF
XDRtm
XDRtm XDRtm
BIF
Cell BE
Processor
XDRtm
04/06/06
IOIF
IOIF
IOIF
Game console systems
Blades
HDTV
Home media servers
Supercomputers
BIF





© 2006 IBM Corporation
Systems and Technology Group
Summary
 Cell ushers in a new era of leading edge processors
optimized for digital media and entertainment
 Desire for realism is driving a convergence between
supercomputing and entertainment
 New levels of performance and power efficiency
beyond what is achieved by PC processors
 Responsiveness to the human user and the network
are key drivers for Cell
 Cell will enable entirely new classes of applications,
even beyond those we contemplate today
42
Course Code: L1T1H1-10 Cell Architecture
04/06/06
© 2006 IBM Corporation
Systems and Technology Group
(c) Copyright International Business Machines Corporation 2005.
All Rights Reserved. Printed in the United Sates April 2005.
The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both.
IBM
IBM Logo
Power Architecture
Other company, product and service names may be trademarks or service marks of others.
All information contained in this document is subject to change without notice. The products described in this document are
NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result
in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change
IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity
under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific
environments, and is presented as an illustration. The results obtained in other operating environments may vary.
While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied
upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.
THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable
for damages arising directly or indirectly from any use of the information contained in this document.
IBM Microelectronics Division
1580 Route 52, Bldg. 504
Hopewell Junction, NY 12533-6351
43
Course Code: L1T1H1-10 Cell Architecture
The IBM home page is http://www.ibm.com
The IBM Microelectronics Division home page is
http://www.chips.ibm.com
04/06/06
© 2006 IBM Corporation