No Slide Title
Download
Report
Transcript No Slide Title
Dynamically Programmable
Array Architecture
Robert Heaton
Obsidian Technology
Confidential
Mesh of Trees
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
Confidential
Busses are BI-directional
2 Cycles to exchange data
Separate X and Y dimensions
Diagonal routing not directly
supported
PU’s difficult to program to
take advantage of structure
Two Dimensional Mesh
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
Confidential
4x4 Hierarchical Cluster
PU
PU
PU
RU
PU
PU
RU
PU
PU
PU
PU
PU
RU
PU
PU
RU
PU
RU
PU
PU
Confidential
PU
Simple 4x4 Cluster Wiring
6*N Wires
Hin1
Hout1
N
PU
PU
PU
PU
2L-2
Joint
1.4
M2 Pitch
Switch
Confidential
Hadr1
Bus width = 140u for 16 bit busses
That is a lot of wires!
Budget 4x4 Cluster area is 1mm2
Routing Hierarchy
PU
PU
PU
RU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
RU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
RU
PU
PU
PU
PU
PU
PU
PU
PU
PU
RU
PU
RU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
RU
PU
PU
PU
RU
RU
PU
PU
PU
PU
PU
RU1
PU
RU
RU
PU
PU
RU
RU1
RU
PU
PU
PU
PU
RU
RU1
RU
PU
PU
RU
PU
RU1
PU
PU
RU
PU
PU
PU
PU
RU
RU1
RU
PU
PU
RU2
RU
PU
PU
RU
RU2
PU
PU
PU
RU
PU
PU
RU
PU
PU
RU
RU1
RU
RU
PU
PU
RU1
RU
PU
PU
RU
RU1
PU
PU
RU
RU
PU
PU
RU
PU
PU
PU
PU
PU
PU
RU
PU
PU
PU
RU
PU
PU
PU
RU3
PU
PU
PU
RU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
RU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
RU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
RU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
RU
PU
PU
PU
PU
PU
PU
PU
RU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
RU
PU
PU
PU
PU
PU
PU
PU
PU
RU
PU
PU
RU
PU
PU
PU
PU
PU
PU
PU
PU
PU
PU
RU
PU
Confidential
PU
RU
PU
PU
RU1
RU
RU
PU
PU
RU
PU
PU
PU
RU
PU
PU
Hadr: up level till
L0adr: local address
L1adr: level 1 address
L2adr: level 2 address
L3adr: level 3 address
PU
PU
RU1
RU
PU
PU
RU
RU1
RU
PU
RU
RU1
RU
RU
RU1
PU
PU
RU
PU
PU
RU2
RU
PU
PU
RU
RU2
PU
PU
RU1
RU
RU
PU
RU
RU1
RU
PU
PU
RU
RU1
PU
PU
RU
RU
256 PUs
4 Levels of hierarchy
PU
RU
PU
PU
PU
Hadr L0adr L1adr L2adr
L3adr
Weeks Investigation (9/12/97)
Investigate routing structures
Dynamic routing assignment/programming
Compromise between area and flexibility
Support for tree of trees
Not a complete story yet!
Confidential
Routing Unit
Process
Unit
(PU)
Process
Unit
(PU)
Routing
Unit
(RU)
Process
Unit
(PU)
Process
Unit
(PU)
Confidential
Full Duplex connect busses
Each PU node controls its
source port via a 2 bit local or
6 bit hierarchical address
Broadcast support
Any node may listen to any
other input to the cluster
Hierarchical node addressing
must not clash
Routing Unit PU Port Detail
from port 0
from port 1
from port 2
from port H
PU Input
N
PU Output
to other ports
N
PU Input address
6
2
4
s0
s1
&
Confidential
Port numbering is clockwise &
relative to each PU port
HBUS port is always at port 3
PU Overview
Simple data path functionality
Primitive control options
Wide instructions control data path function
and operand routing
Conditions may be inverted for “repeat
until” or “Branch If” control
Very primitive address arithmetic
32 or less instructions in program
Confidential
N Bit Functional Unit
A
Constbit
Constbit
mux0
mux1
LSin
RSin
SFTCTL
Bit Shift
Logic functions: OR, XOR, AND, 0, 1
Arithmetic: Add, subtract, Multiply
Shifts: single bit left and right
Conditional detection: 0, -1, <0, >0.
ALUCTL
ALU/MULT
Cout
Cin
Carry
Logic
DFF
mux2
F
Confidential
More optimization needed
Routing issues need more work
N Bit Functional Unit (V2)
Operands
N b it RAM
mux0
N b it RAM
mux1
LSin
RSin
SFTCTL
B Shift
Multiply
Sequencer
ALUCTL
ALU
Cout
Cin
Carry
Logic
DFF
mux2
Out
Confidential
Logic functions: OR, XOR, AND, 0, 1
Arithmetic: Add, subtract
Shifts: right and left shifts
Conditional detection: 0, <0, >0, OF
Memory mapped RAM access to
operands
Instruction Fields
Field
ALU_CTL
SHIFT_CTL
MUX_CTL
BRANCH_ADR
COND_MSK
COND_FLD
EXT_COND_SRC
HEIR_ADDR
L0_ADDR
L1_ADDR
L2_ADDR
L3_ADDR
Comment
Bits
Control of Basic ALU Functions
5
Control of the operand shift
2
Control operand muxes
3
Next address if condition true
2
Condition mask
5
Condition field
5
Select source for external condition inputs 2
Hierarchical routing level address
2
Level 0 source address
2
Level 1 source address
2
Level 2 source address
2
Level 3 source address
2
?? + XN Bits per context
Confidential
PU Instruction Types
32 Bits
Data Process
Move
00
ALU_CTL, SFT_CTL, MUX_CTL, ROUTE_CTL
01 OP_SEL R/W
Operand_Value
Multiply
100 OP_SEL Options
Immediate Operand
Attention
101 Condition
Branch_Adr Options
Flag
Branch
110 Condition
Branch_Adr Options
Link
ROUTE_CTL Field:
Hadr L0adr L1adr L2adr
L3adr
Condition Field:
Invert +ve -ve zero OF X1 X0
Condition Mask
15 Bits
Confidential
Ext’ Source Sel
Condition Field
Condition Field:
Invert +ve -ve zero OF X1 X0
Condition Mask
Ext’ Source Sel
15 Bits
X[1:0] are external condition bits & may be
source from:
Operand bits
Global synchronization bus
Nearest nabough conditions outputs
Condition Mask is anded with flag bits
Confidential
Static Program
Data Process
Adr +1
Branch
PU Never changes function
Branch is set to always true
Just two Instructions
Confidential
Always
More Typical Program
Confidential
Open Issues
PU Data path width
Complexity of shift operations
RU Trunking
Number of contexts per PU
Flexible context RAM partitioning
Improve PU synchronization
Confidential
Shifter Instructions
Confidential
Design Tools
PU Assembler
Architecture mapping
Global resource allocation
Confidential
Conditional N Bit PU Cell
Input
Port address
A
Constbit
RSin
mux0
EXT[1:0]
ALUCTL
LSin
mux1
LSin
SFTCTL
ColSel
RAM
Address
Logic
B
RSin
Bit Shift
ALU/MULT
Branch
Cout
Condition
Logic
Cout
Carry
Logic
Cin
DFF
mux2
F
Out
Confidential
Cin
Commercial Viability
X5 performance improvement over
conventional solutions (mix of cost &
power)
Conceptually simple
Clearly defined target applications
Simple systems connections
Scaleable
Support hardware & software standards
Confidential
Conditional N Bit DPA Cell
Routing Matrix
A
Constbit
RSin
mux0
LSin
mux1
RSin
Bit Shift
ALUCTL
ALU
Branch
Cout
Carry
Logic
Condition
Logic
Cout
DFF
mux2
F
Routing Matrix
Confidential
Cin
Cin
Routing Matrix
EXT[1:0]
B
LSin
SFTCTL
ColSel
RAM
Routing Matrix
Address
Logic
4 Bit Cell:
180 Gates
112 Bits RAM
N Bit Wide DPA
A
Program
Storage
M Plane FU Decode
RAM StatusReg
N bit wide FU
Condition Logic
C
A
Program
Storage
M Plane FU Decode
RAM StatusReg
Program
Storage
Confidential
B
N bit wide FU
Condition Logic
C
A
M Plane FU Decode
RAM StatusReg
B
B
N bit wide FU
Condition Logic
N Bit Wide PU Block
OP Code
Source A
Source B
Shift Op
PipeBus
Status Msk
Instruction Format
Arbit
Local
RAM
N Bit wide Shift
Arbit
Inst
RAM
Addr
Logic
StatusReg
N bit wide ALU
Condition Logic
PipeBus
NOTES/QUESTIONS
- Inst has no const, but has offsets,
- Inst RAM can be small. 64 words?
- note counter takes 3 instructions.
- How much subroutine support? None?
- Simplified 16 bit or full 32 bit instructions.
- 2 or 4 local area busses?
- Synchronization issue: Master states accessible, Cond mask use.
- Option to break or combine N bit DP elements?
- Resource pool on busses? E.g... MULT?
- Approx.. size of 32 bit FU 800u x 500u?
- If so a 16x8 processor array is possible.
- I.e.. 128 processors at 100MHz = 12800MIPS
- Turn off till global state instruction for power reduction
- Handling of interrupts (if at all)
- Handle global signal interrupts how?
- Multiple bit wide segmentation through masks? E.g... 2 counter in one PU?
Confidential
I Decode
State
HierBus
BusW
BusX
B
A
Potential Configuration
128 32 Bit “Pico” Process Units
12800MIPS @ 100MHz
80mm2 in 0.35u CMOS
Concept of hierarchical hardware
scope
Very fast streaming operations
Simple PU programming model
Applications:
Video processing
LAN Routing
DSP Fast Prototyping
Confidential
16 x 8 PU
ARRAY
Controller
Global
Ram
MUX/DMA/FIFO
RAMBUS Interface
256
PU Program Environment
Operands: BusW, BusX, Accumulator,
HierBus, PipeBus, Local Ram.
Use PU Typically runs a small program
– May be as little as two instructions
– 64 words of code maximum
Instruction types:
Arithmetic, logical
Data moving
Interrupt
Confidential
Function Instructions
Arithmetic
1
Counter
1-2
Mux
1
Multiply Accumulate
3
FIFO Stage
3
Multiport Register
1
Shift Register
2
Architecture Figures of Merit
Average density vs application specific cells
Speed of applications vs hardwired logic
Percentage reuse
Confidential
Next Steps
VHDL Modeling of Architecture
Primitive assembler tools for PUs
Selection coding and simulation of
applications
Architecture tuning
Layout and verification of complete DPA
Confidential
Design Tools
Tanner:
Schematic entry, logic simulation, custom layout,
layout verification.
Circuit Simulation.
PC & Sun platforms.
MOSIS Libraries.
Mentor Graphics:
VHDL compilation and simulation.
Confidential
Basic FU Routing
FU
FU
FU
FU
FU
FU
FU
FU
FU
FU
FU
FU
Confidential