Transcript Slide 1

Closely-Coupled
Timing-Directed Partitioning
in HAsim
Michael Pellauer†
[email protected]
Murali Vijayaraghavan†, Michael Adler‡, Arvind†, Joel Emer†‡
†MIT
CS and AI Lab
Computation Structures Group
To Appear In: ISPASS 2008
‡Intel
Corporation
VSSAD Group
Motivation
We want to simulate target platforms quickly
We also want to construct simulators quickly
Partitioned simulators are a known technique from
traditional performance models:
• Micro-architecture
• Resource contention
• Dependencies
• ISA
• Off-chip
communication
Timing
Partition
Interaction
Functional
Partition
• Simplifies timing model
• Amortize functional model design effort over many models
• Functional Partition can be extremely FPGA-optimized
Different Partitioning Schemes
As categorized by Mauer, Hill and Wood:
Source: [MAUER 2002], ACM SIGMETRICS
We believe that a timing-directed solution will
ultimately lead to the best performance
Both partitions upon the FPGA
Functional Partition in Software Asim
Get Instruction (at a given Address)
Get Dependencies
Get Instruction Results
Read Memory*
Speculatively Write Memory* (locally visible)
Commit or Abort instruction
Write Memory* (globally visible)
* Optional
depending on instruction type
Execution in Phases
F
D
X
F
D
F
X
D
X
F
D
F
C
R
C
W
X
D
R
X
C
W
A
X
C
The Emer Assertion:
All data dependencies can be represented via these phases
W
Detailed Example:
3 Different Timing Models
Executing the same instruction sequence:
Functional Partition in Hardware?
Requirements
Support these operations in
hardware
Allow for out-of-order
execution, speculation,
rollback
Challenges
Minimize operation execution
times
Pipeline wherever possible
Tradeoff between
BRAM/multiport RAMs
Race conditions due to
extreme parallelism
Functional Partition As Pipeline
Timing Model
Token
Gen
Fet
Dec
Exe
Mem
LCom
GCom
Functional
Partition
Memory
State
Register State
RegFile
Conveys concept well, but poor performance
Implementation:
Large Scoreboards in BRAM
Series of tables in BRAM
Store information about each in-flight instruction
Tables are indexed by “token”
Also used by the timing partition to refer to each instruction
New operation “getToken” to allocate a space in the tables
Implementing the Operations
See paper for details (also extra slides)
Assessment:
Three Timing Models
Unpipelined Target
5-Stage Pipeline
MIPS R10K-like out-of-order superscalar
Assessment:
Target Performance
Target Processor CPI
3.5
3
Model Cycles per Instruction (CPI)
2.5
2
Unpipelined
5-stage
Out-of-Order
1.5
1
0.5
0
median
multiply
qsort
towers
vvadd
average
Targets have idealized memory hierarchy
Assessment:
Simulator Performance
Simulation Rate
45
40
FPGA-Cycles per Model Cycle (FMR)
35
30
25
Unpipelined
5-Stage
Out-of-Order
20
15
10
5
0
median
multiply
qsort
towers
vvadd
average
Some correspondence between target and
functional partition is very helpful
Assessment:
Reuse and Physical Stats
Where is functionality implemented:
Design
IMem
Program
Counter
Branch
Predictor
Scoreboard/
ROB
Reg
File
Maptable/
Freelist
ALU
DMem
Store
Buffer
Snapshots/
Rollback
Functional
Partition
Unpipelined
N/A
N/A
N/A
N/A
5-Stage
N/A
Out-of-Order
FPGA usage:
Unpipelined
5-stage
Out of Order
FPGA Slices
6599 (20%)
9220 (28%)
22,873 (69%)
Virtex IIPro 70
Block RAMs
18 (5%)
25 (7%)
25 (7%)
Using ISE 8.1i
Clock Speed
98.8 MHz
96.9 MHz
95.0 MHz
Average FMR
41.1
7.49
15.6
Simulation Rate
2.4 MHz
14 MHz
6 MHz
Average
Simulator IPS
2.4 MIPS
5.1 MIPS
4.7 MIPS
N/A
Future Work:
Simulating Multicores
Interaction
occurs
here
Scheme 1: Duplicate both partitions
Timing
Model
A
Func
Reg +
Datapath
Timing
Model
B
Func
Reg +
Datapath
Functional
Memory
State
Func
Reg +
Datapath
Timing
Model
C
Func
Reg +
Datapath
Timing
Model
D
Scheme 2: Cluster Timing Parititions
Use a context ID
to reference all state
lookups
Timing
Model
A
Timing
Model
B
Functional
Reg State +
Datapath
Functional
Memory
State
Timing
Model
C
Timing
Model
D
Interaction
still occurs
here
Future Work: Simulating Multicores
Scheme 3: Perform multiplexing of timing models
themselves
Leverage HASim A-Ports in Timing Model
Out of scope of today’s talk
Timing
Model
Timing
A
Model
Timing
B
Model
Timing
C
Model
D
Functional
Reg State +
Datapath
Use a context ID
to reference all state
lookups
Functional
Memory
State
Interaction
still occurs
here
Future Work:
Unifying with the UT-FAST model
UT-FAST is Functional-First
functional
emulator
running in
software
Func
Partition
execution stream
resteer
Timing
Partition
FPGA
This can be unified into Timing-Directed
Just do “execute-at-fetch”
execution
stream
Emulator
resteer
Ø
Ø
Ø
Ø
functional
emulator
running in
software
Summary
Described a scheme for closely-coupled timingdirected partitioning
Both partitions are suitable for on-FPGA implementation
Demonstrated such a scheme’s benefits:
Very Good Reuse, Very Good Area/Clock Speed
Good FPGA-to-Model Cycle Ratio:
Caveat: Assuming some correspondence between timing model
and functional partitions (recall the unpipelined target)
We plan to extend this using contexts for hardware
multiplexing [Chung 07]
Future: rare complex operations (such as syscalls)
could be done in software using virtual channels
Questions?
[email protected]
Extra Slides
[email protected]
Functional Partition Fetch
Functional Partition Decode
Functional Partition Execute
Functional Partition Back End
Timing Model: Unpipelined
5-Stage Pipeline Timing Model
Out-Of-Order Superscalar Timing Model