Transcript Slide 1
Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer† [email protected] Murali Vijayaraghavan†, Michael Adler‡, Arvind†, Joel Emer†‡ †MIT CS and AI Lab Computation Structures Group To Appear In: ISPASS 2008 ‡Intel Corporation VSSAD Group Motivation We want to simulate target platforms quickly We also want to construct simulators quickly Partitioned simulators are a known technique from traditional performance models: • Micro-architecture • Resource contention • Dependencies • ISA • Off-chip communication Timing Partition Interaction Functional Partition • Simplifies timing model • Amortize functional model design effort over many models • Functional Partition can be extremely FPGA-optimized Different Partitioning Schemes As categorized by Mauer, Hill and Wood: Source: [MAUER 2002], ACM SIGMETRICS We believe that a timing-directed solution will ultimately lead to the best performance Both partitions upon the FPGA Functional Partition in Software Asim Get Instruction (at a given Address) Get Dependencies Get Instruction Results Read Memory* Speculatively Write Memory* (locally visible) Commit or Abort instruction Write Memory* (globally visible) * Optional depending on instruction type Execution in Phases F D X F D F X D X F D F C R C W X D R X C W A X C The Emer Assertion: All data dependencies can be represented via these phases W Detailed Example: 3 Different Timing Models Executing the same instruction sequence: Functional Partition in Hardware? Requirements Support these operations in hardware Allow for out-of-order execution, speculation, rollback Challenges Minimize operation execution times Pipeline wherever possible Tradeoff between BRAM/multiport RAMs Race conditions due to extreme parallelism Functional Partition As Pipeline Timing Model Token Gen Fet Dec Exe Mem LCom GCom Functional Partition Memory State Register State RegFile Conveys concept well, but poor performance Implementation: Large Scoreboards in BRAM Series of tables in BRAM Store information about each in-flight instruction Tables are indexed by “token” Also used by the timing partition to refer to each instruction New operation “getToken” to allocate a space in the tables Implementing the Operations See paper for details (also extra slides) Assessment: Three Timing Models Unpipelined Target 5-Stage Pipeline MIPS R10K-like out-of-order superscalar Assessment: Target Performance Target Processor CPI 3.5 3 Model Cycles per Instruction (CPI) 2.5 2 Unpipelined 5-stage Out-of-Order 1.5 1 0.5 0 median multiply qsort towers vvadd average Targets have idealized memory hierarchy Assessment: Simulator Performance Simulation Rate 45 40 FPGA-Cycles per Model Cycle (FMR) 35 30 25 Unpipelined 5-Stage Out-of-Order 20 15 10 5 0 median multiply qsort towers vvadd average Some correspondence between target and functional partition is very helpful Assessment: Reuse and Physical Stats Where is functionality implemented: Design IMem Program Counter Branch Predictor Scoreboard/ ROB Reg File Maptable/ Freelist ALU DMem Store Buffer Snapshots/ Rollback Functional Partition Unpipelined N/A N/A N/A N/A 5-Stage N/A Out-of-Order FPGA usage: Unpipelined 5-stage Out of Order FPGA Slices 6599 (20%) 9220 (28%) 22,873 (69%) Virtex IIPro 70 Block RAMs 18 (5%) 25 (7%) 25 (7%) Using ISE 8.1i Clock Speed 98.8 MHz 96.9 MHz 95.0 MHz Average FMR 41.1 7.49 15.6 Simulation Rate 2.4 MHz 14 MHz 6 MHz Average Simulator IPS 2.4 MIPS 5.1 MIPS 4.7 MIPS N/A Future Work: Simulating Multicores Interaction occurs here Scheme 1: Duplicate both partitions Timing Model A Func Reg + Datapath Timing Model B Func Reg + Datapath Functional Memory State Func Reg + Datapath Timing Model C Func Reg + Datapath Timing Model D Scheme 2: Cluster Timing Parititions Use a context ID to reference all state lookups Timing Model A Timing Model B Functional Reg State + Datapath Functional Memory State Timing Model C Timing Model D Interaction still occurs here Future Work: Simulating Multicores Scheme 3: Perform multiplexing of timing models themselves Leverage HASim A-Ports in Timing Model Out of scope of today’s talk Timing Model Timing A Model Timing B Model Timing C Model D Functional Reg State + Datapath Use a context ID to reference all state lookups Functional Memory State Interaction still occurs here Future Work: Unifying with the UT-FAST model UT-FAST is Functional-First functional emulator running in software Func Partition execution stream resteer Timing Partition FPGA This can be unified into Timing-Directed Just do “execute-at-fetch” execution stream Emulator resteer Ø Ø Ø Ø functional emulator running in software Summary Described a scheme for closely-coupled timingdirected partitioning Both partitions are suitable for on-FPGA implementation Demonstrated such a scheme’s benefits: Very Good Reuse, Very Good Area/Clock Speed Good FPGA-to-Model Cycle Ratio: Caveat: Assuming some correspondence between timing model and functional partitions (recall the unpipelined target) We plan to extend this using contexts for hardware multiplexing [Chung 07] Future: rare complex operations (such as syscalls) could be done in software using virtual channels Questions? [email protected] Extra Slides [email protected] Functional Partition Fetch Functional Partition Decode Functional Partition Execute Functional Partition Back End Timing Model: Unpipelined 5-Stage Pipeline Timing Model Out-Of-Order Superscalar Timing Model