Transcript FPGA Based SAT Solver - University of California, Berkeley
Multithreaded SPARC v8 Functional Model for RAMP Gold Zhangxi Tan
UC Berkeley RAMP Retreat, Jan 17, 2008
Motivation
• Traditional RISC optimizations are far less appealing on soft-core processors on FPGAs – Mapped to expensive wide bus muxes; becomes area/frequency bottleneck on fabric • Bypassing network • Delayed branch – Less efficient when dealing with memory access latency • Small cache size & shared memory controller make things even worse –
Poor core count on single FPGA!
(e.g. V5 LX110T) • <16 32-bit Sparc V8 Leon integer pipeline
•
Approach
Need a new functional model, which is able to – Support a large number of emulated cores ( ~1k ) per BEE3 board – Accelerate aggregate emulate performance (MIPS/chip) • Including optimizations to tolerate memory & I/O latency – Run full OS and support OS development • TLB/exception support • Memory mapped I/O + IRQ support – Interfacing with timing model • Virtualizing Sparc V8 RTL with fine-grain multithreading – High density design (256/512 emulated CPUs per chip) • 8 cores in 2 clusters per FPGA (V5 LX110T); each core has 32 or 64 threads (configurable) • 4 cores in one cluster share one BEE3 mem controller – Start from 32-bit ISA, eventually support 64-bit ISA (v9)
Design philosophy 1
• Keep everything simple !
– Build processor w/o bypassing network • • Greatly simplify pipeline design Preliminary result shows ~28% LUT reduction + ~18% frequency improvement on Leon3 processor – Direct map cache/TLB – Simple fine-grain multithreading to fill pipeline bubbles • • Static RR issue : T1->T2->T3->T4->T1->T2…..
Never – stall the pipeline Long latency operations? – Tell the pipeline to REPLAY the instruction in the next rotation – “Microcode” for complex instructions/trap handling
Design philosophy 2
• Design for fabric (Targeting Virtex 5) – High working frequency (expect ~150 MHz) • Deep pipeline: 10~11 physical stages – Manually controlled FPGA resources mapping • • BRAMs, LUTRAM Use V5 DSPs as ALU • Pipelining all BRAMs and DSPs. (maximize Fmax) – Error detection/correction for all BRAMs • Cache tags and register file use parity bit to detect soft errors • TLB entry and cache data are protected by built-in V5 ECC BRAM
Challenges
• • • Thread state storage & per-thread L1 cache – Will BRAM/LUTRAM fit?
– How large ?
– Where to map? LUTRAM or BRAM Bandwidth and RW ports requirement – Multithreading amplifies the requirement!
How to make use of FPGA primitives to control total LUT usage – 6-input LUTs: LUT5_2, RAM64B – DSPs
State storage
• Main thread state (integer pipeline) – 3 register windows per thread (2-minimum by specification, 3 for performance) • 8 global + 16*3 window registers • Stored in BRAM in chunks of 64 registers – PC/nPC – LUTRAM – PSR (processor state register) – LUTRAM – WIM (register window mask) – LUTRAM – TBR (trap base register) – BRAM packed w. 3 reg window – Y (high 32-bit for mul/div) - LUTRAM
Regfile layout
Thread
0 1 2
BRAM Address
0-7
BRAM Content
Global register g0-g7 8 9-15 16-63 64-71 72 73-79 TBR scratch register for microcode mode 3-register window Global register g0-g7 TBR 80-127 … scratch register for microcode mode 3-register window ….
• • • 64 threads per pipeline, 8 pipelines per chip (V5 LX110T) • Eight 18kb blocks Double clocked BRAM (virtually 4 ports) Indexed with {thread_id, reg_addr}
Cache & TLB
• Per thread Cache – Split I/D direct-map write-allocate write-back cache • Block size: 32 bytes (BEE3 DDR2 controller heart beat) • 512B total in 64-thread configuration : 256B – I$, 256B – D$ – Size doubled (1KB) for 32-thread configuration • Non-blocking to a different thread, but blocking to the same thread • CPU and memory controller access cache at the same time through different ports – Physical tag • – Per thread TLB split I/D direct-map TLB • 16 entries in total : 8 for ITLB and 8 for DTLB • Total BRAM usage per thread (regfile + cache/TLB + tag +misc) : 30~32 blocks (18kb) • BRAM is still the critical resource
DSP48E are perfect for ALU
• • DSP48E is a MAC.
Two 48-bit inputs, one 48-bit output – Add/subtract/logic/by pass/address calculation – Pattern detector (generate Z flag) • <10 LUTs for C, O, nothing for N
Mapping SPARC instructions to DSP48E
• • Most of SPARC v8 instructions can be covered by DSP48E – 1 cycle ALU (1 DSP) • LD/ST (address calculation) • Bit-wise logic (and, or, …) • SETHI • JMPL, RETT, Call • Write special register (WRPSR) • SAVE/RESTORE – Long latency ALU • Pipelined shift/Mul (4 DSPs) • Divide (1 DSP) – Misc • RDPSR, RDWIM (XOR ops.) Only one 32-bit adder is not in DSP (nPC+4) • DSP48E is not silver bullet – Barrel shifter/shifter support is weak • Altera does better on shifters – 48-bit is odd! • Expecting 64-bit inputs DSPs w. 32x32 multipliers (DSP64E?)
•
Pipeline Arch
7-stage pipeline – MMU support soon
Thread Selection Instruction Fetch Decode Special Registers (pc/npc, wim, psr, thread control registers) Static Thread Selection
(Round Robin)
Microcode ROM
Micro inst.
Instruction Fetch 1
(Issue address Request) Synthesized Instruction 32-bit Instruction
Instruction Fetch 2
(compare tag) Tag compare result Tag Tag/Data read request
Decode
(Resolve Branch, Decode register file address) Mem request under cache miss
I-Cache (nine 18kb BRAMs) 256-bit memory interface Virtex 5 LX110T
BEE3 DDR2 Memory controller 2
Register File Access Cluster 2 256B I$ 256B D$ SPARC V8 Pipeline (64 Threads)
Core 5 Core 1
SPARC V8 Pipeline (64 Threads) 256B I$ 256B D$ Cluster 1 256B I$ 256B D$ SPARC V8 Pipeline (64 Threads)
Core 6 Core 2
SPARC V8 Pipeline (64 Threads) 256B I$ 256B D$ SPARC V8 Pipeline (64 Threads)
Core 7 Core 3
SPARC V8 Pipeline (64 Threads) 256B I$ 256B D$ 256B I$ 256B D$ 256B I$ 256B D$
144 bits
SPARC V8 Pipeline (64 Threads)
Core 8 Core 4
SPARC V8 Pipeline (64 Threads) Execution Memory 256B I$ 256B D$
144 bits
Write Back
BEE3 DDR2 Memory controller 1 OP2
32-bit Register File (four 36kb BRAMs)
imm
MUL/DIV/SHF
(4 DSPs)
Regfile Access
(1 or 2 cycles) pc
Decode ALU control/Exception Detection
OP1
Simple ALU (1 DSP) /LDST decoding Special register handling
(RDPSR/RDWIM)
Generate microcode request Load align / Write Back
128-bit read & modify data
LUT ROM LUT RAM (clk x2) DSP (clk x2) BRAM (clk x2) Unaligned address detection / Store preparation Load
(issue address) Tag/Data read request Tag / 128-bit data
D-Cache (nine 18kb BRAMs) Trap/IRQ handling Read & Modify 256-bit memory interface
Status
• • • Coded in Systemverilog – ~4000 lines of code implemented Push to synthesis tools in Feb 08 – Synthesize with Precision or Synplify – Full V8 instruction (integer) support (no MMU) – Aiming ~150 MHz, estimate <4000 LUTs per core Verification Goal – pass microsparc verification suite / sparc.org certification test
Backup Slides
SPARC vs MIPS
• • Similar ISA – Similar ALU/Jump and Link/Jump instructions – Similar LD/ST inst. (LDB, LDH, LDW) – Delay branch Except – Branch on 4 condition codes (N, C, O, Z) • E.g. Addcc r1, r2, r3 Bicc address – – Trap on condition code for SW traps (e.g. System call) Register window ( 2-32 windows) • Only 1 window (32 registers) activates, controlled by CWP field in Processor State Register (PSR) • SAVE/RESTORE, RETT, trap will affect the window • SAVE/RESTORE are common used in function call – – No FPU <-> Integer register file transfer instructions Difference in atomic instructions: • MIPS: LL/SC, SPARC: LDSTUB, SWAP