FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.

Download Report

Transcript FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.

FPGA-based Fast, Cycle-Accurate Full System Simulators

Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin

Wouldn’t it be nice to have a simulator that is

      Fast  10M cycles per second, fast enough to run real datasets to completion Accurate  Produce cycle-accurate numbers Complete  Run real operating systems, applications Transparent  Can see everything in processor, no performance hit Inexpensive  Need thousands Usable  Quick changes, easy to see performance

Software?

 Software-based simulators inherently cannot achieve this speed and be cycle-accurate at the same time   A 128 entry, fully-associative TLB at the limit requires 128 load, compare operations Arbitration requires first looking across multiple bidders  There are lots of these structures in a complex processor!

 Thousands to tens of thousands of events  Even with perfect parallelism, need a lot of CPUs

Hardware

 Clearly, hardware is necessary  Reconfigurability (read FPGAs) is required for flexibility  But how?

Full Implementation?

  Take RTL code, compile for FPGA   Implementing full system in FPGA is prohibitively large Shih Lin Lu’s group has single original Pentium (586, 3.1M transistors) in largest Xilinx FPGA Emulate Pentium M in a single FPGA?

 140M transistors   Instead, what about    Accurately (to cycle resolution) simulate its behavior Running real, unmodified applications, OS With full visibility at full speed?

If execution speeds are reasonable, do I care?

Derek Chiou, UTexas, Austin

Can I Partition the Problem?

 64b adder way too big to be implemented as a single monolithic entity  But, I can implement 64 1b adders very easily with very little state and complexity  Partitioning is good if possible  But, how to partition?

Classic Partitioning

  On module boundary   Caches, memories, ALUs, processors, memory controllers Partitioning doesn’t save state or complexity, but enables design to be partitioned over multiple FPGAs and software Problems?

0x2 Add

bypass I 1

IR IR IR PC

I 2

addr inst IR Instruction $/Mem we rr1 rr2 rd1 wr wd rd2 GPR File Immed.

Extend A B ALU Y we waddr raddr algn rdata Data $/Memory wdata re MD1 MD2 0 1 2 M 3 R

Functional/Timing Partition

 Functional model simulates ISA  Timing model simulates micro-architecture  Asim and Simplescalar are written like this    Software One processor Lots of interaction between functional and timing  Intended to avoid rollback of any component  Put timing model in FPGA???

 Parallel component executed in hardware!

UT FAST Partitioning

 On ISA/micro-architecture boundary (ISA + FPGA)   Instruction trace generated by ISA simulator (e.g., Bochs, Simics)  Fast, full system but no timing information (could be hardware!!!) What do we need to simulate in the timing model?

0x2 Add

bypass I 1

IR IR IR PC

Trace

I 2

addr inst IR Instruction Memory we rr1 rr2 rd1 wr wd rd2 GPR File Immed.

Extend A B ALU Y we waddr raddr algn rdata Data Memory wdata re MD1 MD2 0 1 2 M 3 R

   

UT FAST Complex Processors

Straight pipelines are easy what about Caches/TLBs?

  Keep tags, pass address (virtual and physical if necessary) Hits, misses determined but don’t need data Superscalar (multiple issue)?

 “Fetch and issue” multiple instructions assuming they meet boundary constraints  Multiple “functional units”     Reservation stations Reorder buffer Pipeline control along with instructions NO DATAPATH!!!

Timing Model speed almost unimportant!

 Multi-cycle memories to create more ports ALU Instruction stream GPR Rename Delay GPR Read Delay ALU Reorder Buffer Memory Br I-Fetch Delay I-Decode Delay Ldst D-Cache FPR Rename Delay Ldst FPR Read Delay FPU Memory Controller FPU I-Cache BIU Disk Network

Example of Complication: Branch Prediction

 Must process mis-speculated instructions in timing model     Implement BP in timing model Timing model forces ISA simulator to mis-speculate  Rollback, restore Requires support from ISA simulator Branch predictor predictor in ISA simulator?

 BP only works in processor if it’s fairly accurate 

FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path

 Most complexity (BP, parallelism) can be handled this way

Status & Conclusions

     1MHz to 100MHz, cycle-accurate, full-system, multiprocessor simulator  Well, not quite that fast right now, but we are using embedded 300MHz PowerPC 405 to simplify X86, boots Linux, Windows, targeting 80486 to Pentium D-like and beyond (Dam Sunwoo, Nikhil Patil)   Bochs functional model (looking at much faster models) Heavily modified instruction trace and rollback Branch-predicted superscalar model almost done in Bluespec and Verilog (John Xu, Huzefa Sanjeliwala)  Have straight pipeline 486 model with TLBs and caches Statistics gathered in hardware  Very little if any probe effect Tools to semi-automate micro-architectural and ISA level exploration  Orthogonality of models makes both simpler