A Design Space Exploration of Grid Processor Architectures

Download Report

Transcript A Design Space Exploration of Grid Processor Architectures

A Design Space Exploration of
Grid Processor Architectures
R. Nagarajan, K. Sankaralingam D. Burger and S. W. Keckler
The University of Texas at Austin
MICRO, 2001
Presented by Jie Xiao
April 2, 2008
Outline





Motivation
Block-Atomic Execution Model
GPA Implementation
Evaluation
Conclusion
Motivation

Performance Improvement came from:



Existing Problems:




Clock rate ↑
ILP ↑
ILP improvement is small
Pipeline depth limits Clock rate increase
Wire delay slows down IPC
Goal:



Clock rate ↑
ILP ↑
Wire delay ↓
Block-Atomic Execution Model

An atomic unit / group / hyperblock



A block of instructions is an atomic unit of
fetch/schedule/execute/commit
Groupings are done by the compiler
Three types of data for each group



Group inputs
move: reg file -> ALU
Group temporaries
point-to-point operand delivery -> reg file bandwidth↓
data-driven – dataflow machine
Group outputs
Example
Block-Atomic Execution Model

Advantages






No centralized, associative issue window
No reg renaming table
Fewer reg file reads and writes
No broadcasting
Reduced conventional wire and communication
delay
Compiler-controlled physical layout ensures that
the critical path is scheduled along the shortest
physical path
GPA Implementation

Each node contains:


Instruction buffer
Execution unit
GPA Implementation

Virtual Grid – use frames
GPA Implementation

Block stitching


Overlap fetch, map and execution
Speculatively – using a block level
predicator
GPA Implementation

Hyperblock control


Blocks allow simple internal control flow
Execute-all for predication within a block
Evaluation




3 SPECInt2000, 3SPECFP2000, 3
Mediabench benchmarks
Compiled using the Trimaran toolset
Trimaran simulator
Load instructions are placed close to Dcaches
Evaluation

Superscalar




GPA





5 stage pipeline, 8-wide
0 cycle router and wire delay!
512 entry instruction window
8x8 grid
0.25 cycle router + 0.25 cycle wire delay
32 slots at every node
• Alpha 21264 functional unit latencies
• L1: 3 cycles, L2: 13 cycles, Main memory: 62 cycles
Evaluation

White portion: perfect memory and perfect branch
prediction
Evaluation

Delays




Number of hops – mapping (Table 3)
Wire delays
Router delay at each hop
Contentions – not a big deal (Table 4)
Analysis

Grid network design



Pre-reserving network channels
Express channels
Predication strategy -- Execute-all


Fewer instructions wait for the predicate to
be calculated
Less efficient use of power
Comparison

GPA vs. Tagged-Token Dataflow Arch





Both are data-driven
GPA – conventional programming interface
GPA – complier instruction mapping
GPA – reduce the complexity of the token
matching
GPA is a hybrid of VLIW and superscalar


Statically schedule
Dynamically issue
Conclusion
 Strength
 No centralized, associative issue window
 No reg renaming table
 Fewer reg file reads and writes
 No broadcasting
 Reduced conventional wire and communication delay
 Compiler-controlled physical layout ensures that the critical
path is scheduled along the shortest physical path

Challenges


Wire delay and router delay
Complex frame management and block stitching