FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.
Download ReportTranscript FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.
FPGA-based Fast, Cycle-Accurate Full System Simulators
Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin
Wouldn’t it be nice to have a simulator that is
Fast 10M cycles per second, fast enough to run real datasets to completion Accurate Produce cycle-accurate numbers Complete Run real operating systems, applications Transparent Can see everything in processor, no performance hit Inexpensive Need thousands Usable Quick changes, easy to see performance
Software?
Software-based simulators inherently cannot achieve this speed and be cycle-accurate at the same time A 128 entry, fully-associative TLB at the limit requires 128 load, compare operations Arbitration requires first looking across multiple bidders There are lots of these structures in a complex processor!
Thousands to tens of thousands of events Even with perfect parallelism, need a lot of CPUs
Hardware
Clearly, hardware is necessary Reconfigurability (read FPGAs) is required for flexibility But how?
Full Implementation?
Take RTL code, compile for FPGA Implementing full system in FPGA is prohibitively large Shih Lin Lu’s group has single original Pentium (586, 3.1M transistors) in largest Xilinx FPGA Emulate Pentium M in a single FPGA?
140M transistors Instead, what about Accurately (to cycle resolution) simulate its behavior Running real, unmodified applications, OS With full visibility at full speed?
If execution speeds are reasonable, do I care?
Derek Chiou, UTexas, Austin
Can I Partition the Problem?
64b adder way too big to be implemented as a single monolithic entity But, I can implement 64 1b adders very easily with very little state and complexity Partitioning is good if possible But, how to partition?
Classic Partitioning
On module boundary Caches, memories, ALUs, processors, memory controllers Partitioning doesn’t save state or complexity, but enables design to be partitioned over multiple FPGAs and software Problems?
0x2 Add
bypass I 1
IR IR IR PC
I 2
addr inst IR Instruction $/Mem we rr1 rr2 rd1 wr wd rd2 GPR File Immed.
Extend A B ALU Y we waddr raddr algn rdata Data $/Memory wdata re MD1 MD2 0 1 2 M 3 R
Functional/Timing Partition
Functional model simulates ISA Timing model simulates micro-architecture Asim and Simplescalar are written like this Software One processor Lots of interaction between functional and timing Intended to avoid rollback of any component Put timing model in FPGA???
Parallel component executed in hardware!
UT FAST Partitioning
On ISA/micro-architecture boundary (ISA + FPGA) Instruction trace generated by ISA simulator (e.g., Bochs, Simics) Fast, full system but no timing information (could be hardware!!!) What do we need to simulate in the timing model?
0x2 Add
bypass I 1
IR IR IR PC
Trace
I 2
addr inst IR Instruction Memory we rr1 rr2 rd1 wr wd rd2 GPR File Immed.
Extend A B ALU Y we waddr raddr algn rdata Data Memory wdata re MD1 MD2 0 1 2 M 3 R
UT FAST Complex Processors
Straight pipelines are easy what about Caches/TLBs?
Keep tags, pass address (virtual and physical if necessary) Hits, misses determined but don’t need data Superscalar (multiple issue)?
“Fetch and issue” multiple instructions assuming they meet boundary constraints Multiple “functional units” Reservation stations Reorder buffer Pipeline control along with instructions NO DATAPATH!!!
Timing Model speed almost unimportant!
Multi-cycle memories to create more ports ALU Instruction stream GPR Rename Delay GPR Read Delay ALU Reorder Buffer Memory Br I-Fetch Delay I-Decode Delay Ldst D-Cache FPR Rename Delay Ldst FPR Read Delay FPU Memory Controller FPU I-Cache BIU Disk Network
Example of Complication: Branch Prediction
Must process mis-speculated instructions in timing model Implement BP in timing model Timing model forces ISA simulator to mis-speculate Rollback, restore Requires support from ISA simulator Branch predictor predictor in ISA simulator?
BP only works in processor if it’s fairly accurate
FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path
Most complexity (BP, parallelism) can be handled this way
Status & Conclusions
1MHz to 100MHz, cycle-accurate, full-system, multiprocessor simulator Well, not quite that fast right now, but we are using embedded 300MHz PowerPC 405 to simplify X86, boots Linux, Windows, targeting 80486 to Pentium D-like and beyond (Dam Sunwoo, Nikhil Patil) Bochs functional model (looking at much faster models) Heavily modified instruction trace and rollback Branch-predicted superscalar model almost done in Bluespec and Verilog (John Xu, Huzefa Sanjeliwala) Have straight pipeline 486 model with TLBs and caches Statistics gathered in hardware Very little if any probe effect Tools to semi-automate micro-architectural and ISA level exploration Orthogonality of models makes both simpler