A Design Space Exploration of Grid Processor Architectures
Download
Report
Transcript A Design Space Exploration of Grid Processor Architectures
A Design Space Exploration of
Grid Processor Architectures
R. Nagarajan, K. Sankaralingam D. Burger and S. W. Keckler
The University of Texas at Austin
MICRO, 2001
Presented by Jie Xiao
April 2, 2008
Outline
Motivation
Block-Atomic Execution Model
GPA Implementation
Evaluation
Conclusion
Motivation
Performance Improvement came from:
Existing Problems:
Clock rate ↑
ILP ↑
ILP improvement is small
Pipeline depth limits Clock rate increase
Wire delay slows down IPC
Goal:
Clock rate ↑
ILP ↑
Wire delay ↓
Block-Atomic Execution Model
An atomic unit / group / hyperblock
A block of instructions is an atomic unit of
fetch/schedule/execute/commit
Groupings are done by the compiler
Three types of data for each group
Group inputs
move: reg file -> ALU
Group temporaries
point-to-point operand delivery -> reg file bandwidth↓
data-driven – dataflow machine
Group outputs
Example
Block-Atomic Execution Model
Advantages
No centralized, associative issue window
No reg renaming table
Fewer reg file reads and writes
No broadcasting
Reduced conventional wire and communication
delay
Compiler-controlled physical layout ensures that
the critical path is scheduled along the shortest
physical path
GPA Implementation
Each node contains:
Instruction buffer
Execution unit
GPA Implementation
Virtual Grid – use frames
GPA Implementation
Block stitching
Overlap fetch, map and execution
Speculatively – using a block level
predicator
GPA Implementation
Hyperblock control
Blocks allow simple internal control flow
Execute-all for predication within a block
Evaluation
3 SPECInt2000, 3SPECFP2000, 3
Mediabench benchmarks
Compiled using the Trimaran toolset
Trimaran simulator
Load instructions are placed close to Dcaches
Evaluation
Superscalar
GPA
5 stage pipeline, 8-wide
0 cycle router and wire delay!
512 entry instruction window
8x8 grid
0.25 cycle router + 0.25 cycle wire delay
32 slots at every node
• Alpha 21264 functional unit latencies
• L1: 3 cycles, L2: 13 cycles, Main memory: 62 cycles
Evaluation
White portion: perfect memory and perfect branch
prediction
Evaluation
Delays
Number of hops – mapping (Table 3)
Wire delays
Router delay at each hop
Contentions – not a big deal (Table 4)
Analysis
Grid network design
Pre-reserving network channels
Express channels
Predication strategy -- Execute-all
Fewer instructions wait for the predicate to
be calculated
Less efficient use of power
Comparison
GPA vs. Tagged-Token Dataflow Arch
Both are data-driven
GPA – conventional programming interface
GPA – complier instruction mapping
GPA – reduce the complexity of the token
matching
GPA is a hybrid of VLIW and superscalar
Statically schedule
Dynamically issue
Conclusion
Strength
No centralized, associative issue window
No reg renaming table
Fewer reg file reads and writes
No broadcasting
Reduced conventional wire and communication delay
Compiler-controlled physical layout ensures that the critical
path is scheduled along the shortest physical path
Challenges
Wire delay and router delay
Complex frame management and block stitching