FPGA Based SAT Solver - University of California, Berkeley

Download Report

Transcript FPGA Based SAT Solver - University of California, Berkeley

Multithreaded SPARC v8 Functional Model for RAMP Gold Zhangxi Tan

UC Berkeley RAMP Retreat, Jan 17, 2008

Motivation

• Traditional RISC optimizations are far less appealing on soft-core processors on FPGAs – Mapped to expensive wide bus muxes; becomes area/frequency bottleneck on fabric • Bypassing network • Delayed branch – Less efficient when dealing with memory access latency • Small cache size & shared memory controller make things even worse –

Poor core count on single FPGA!

(e.g. V5 LX110T) • <16 32-bit Sparc V8 Leon integer pipeline

Approach

Need a new functional model, which is able to – Support a large number of emulated cores ( ~1k ) per BEE3 board – Accelerate aggregate emulate performance (MIPS/chip) • Including optimizations to tolerate memory & I/O latency – Run full OS and support OS development • TLB/exception support • Memory mapped I/O + IRQ support – Interfacing with timing model • Virtualizing Sparc V8 RTL with fine-grain multithreading – High density design (256/512 emulated CPUs per chip) • 8 cores in 2 clusters per FPGA (V5 LX110T); each core has 32 or 64 threads (configurable) • 4 cores in one cluster share one BEE3 mem controller – Start from 32-bit ISA, eventually support 64-bit ISA (v9)

Design philosophy 1

• Keep everything simple !

– Build processor w/o bypassing network • • Greatly simplify pipeline design Preliminary result shows ~28% LUT reduction + ~18% frequency improvement on Leon3 processor – Direct map cache/TLB – Simple fine-grain multithreading to fill pipeline bubbles • • Static RR issue : T1->T2->T3->T4->T1->T2…..

Never – stall the pipeline Long latency operations? – Tell the pipeline to REPLAY the instruction in the next rotation – “Microcode” for complex instructions/trap handling

Design philosophy 2

• Design for fabric (Targeting Virtex 5) – High working frequency (expect ~150 MHz) • Deep pipeline: 10~11 physical stages – Manually controlled FPGA resources mapping • • BRAMs, LUTRAM Use V5 DSPs as ALU • Pipelining all BRAMs and DSPs. (maximize Fmax) – Error detection/correction for all BRAMs • Cache tags and register file use parity bit to detect soft errors • TLB entry and cache data are protected by built-in V5 ECC BRAM

Challenges

• • • Thread state storage & per-thread L1 cache – Will BRAM/LUTRAM fit?

– How large ?

– Where to map? LUTRAM or BRAM Bandwidth and RW ports requirement – Multithreading amplifies the requirement!

How to make use of FPGA primitives to control total LUT usage – 6-input LUTs: LUT5_2, RAM64B – DSPs

State storage

• Main thread state (integer pipeline) – 3 register windows per thread (2-minimum by specification, 3 for performance) • 8 global + 16*3 window registers • Stored in BRAM in chunks of 64 registers – PC/nPC – LUTRAM – PSR (processor state register) – LUTRAM – WIM (register window mask) – LUTRAM – TBR (trap base register) – BRAM packed w. 3 reg window – Y (high 32-bit for mul/div) - LUTRAM

Regfile layout

Thread

0 1 2

BRAM Address

0-7

BRAM Content

Global register g0-g7 8 9-15 16-63 64-71 72 73-79 TBR scratch register for microcode mode 3-register window Global register g0-g7 TBR 80-127 … scratch register for microcode mode 3-register window ….

• • • 64 threads per pipeline, 8 pipelines per chip (V5 LX110T) • Eight 18kb blocks Double clocked BRAM (virtually 4 ports) Indexed with {thread_id, reg_addr}

Cache & TLB

• Per thread Cache – Split I/D direct-map write-allocate write-back cache • Block size: 32 bytes (BEE3 DDR2 controller heart beat) • 512B total in 64-thread configuration : 256B – I$, 256B – D$ – Size doubled (1KB) for 32-thread configuration • Non-blocking to a different thread, but blocking to the same thread • CPU and memory controller access cache at the same time through different ports – Physical tag • – Per thread TLB split I/D direct-map TLB • 16 entries in total : 8 for ITLB and 8 for DTLB • Total BRAM usage per thread (regfile + cache/TLB + tag +misc) : 30~32 blocks (18kb) • BRAM is still the critical resource

DSP48E are perfect for ALU

• • DSP48E is a MAC.

Two 48-bit inputs, one 48-bit output – Add/subtract/logic/by pass/address calculation – Pattern detector (generate Z flag) • <10 LUTs for C, O, nothing for N 

Mapping SPARC instructions to DSP48E

• • Most of SPARC v8 instructions can be covered by DSP48E – 1 cycle ALU (1 DSP) • LD/ST (address calculation) • Bit-wise logic (and, or, …) • SETHI • JMPL, RETT, Call • Write special register (WRPSR) • SAVE/RESTORE – Long latency ALU • Pipelined shift/Mul (4 DSPs) • Divide (1 DSP) – Misc • RDPSR, RDWIM (XOR ops.) Only one 32-bit adder is not in DSP (nPC+4) • DSP48E is not silver bullet – Barrel shifter/shifter support is weak • Altera does better on shifters – 48-bit is odd! • Expecting 64-bit inputs DSPs w. 32x32 multipliers (DSP64E?)

Pipeline Arch

7-stage pipeline – MMU support soon

Thread Selection Instruction Fetch Decode Special Registers (pc/npc, wim, psr, thread control registers) Static Thread Selection

(Round Robin)

Microcode ROM

Micro inst.

Instruction Fetch 1

(Issue address Request) Synthesized Instruction 32-bit Instruction

Instruction Fetch 2

(compare tag) Tag compare result Tag Tag/Data read request

Decode

(Resolve Branch, Decode register file address) Mem request under cache miss

I-Cache (nine 18kb BRAMs) 256-bit memory interface Virtex 5 LX110T

BEE3 DDR2 Memory controller 2

Register File Access Cluster 2 256B I$ 256B D$ SPARC V8 Pipeline (64 Threads)

Core 5 Core 1

SPARC V8 Pipeline (64 Threads) 256B I$ 256B D$ Cluster 1 256B I$ 256B D$ SPARC V8 Pipeline (64 Threads)

Core 6 Core 2

SPARC V8 Pipeline (64 Threads) 256B I$ 256B D$ SPARC V8 Pipeline (64 Threads)

Core 7 Core 3

SPARC V8 Pipeline (64 Threads) 256B I$ 256B D$ 256B I$ 256B D$ 256B I$ 256B D$

144 bits

SPARC V8 Pipeline (64 Threads)

Core 8 Core 4

SPARC V8 Pipeline (64 Threads) Execution Memory 256B I$ 256B D$

144 bits

Write Back

BEE3 DDR2 Memory controller 1 OP2

32-bit Register File (four 36kb BRAMs)

imm

MUL/DIV/SHF

(4 DSPs)

Regfile Access

(1 or 2 cycles) pc

Decode ALU control/Exception Detection

OP1

Simple ALU (1 DSP) /LDST decoding Special register handling

(RDPSR/RDWIM)

Generate microcode request Load align / Write Back

128-bit read & modify data

LUT ROM LUT RAM (clk x2) DSP (clk x2) BRAM (clk x2) Unaligned address detection / Store preparation Load

(issue address) Tag/Data read request Tag / 128-bit data

D-Cache (nine 18kb BRAMs) Trap/IRQ handling Read & Modify 256-bit memory interface

Status

• • • Coded in Systemverilog – ~4000 lines of code implemented Push to synthesis tools in Feb 08 – Synthesize with Precision or Synplify – Full V8 instruction (integer) support (no MMU) – Aiming ~150 MHz, estimate <4000 LUTs per core Verification Goal – pass microsparc verification suite / sparc.org certification test

Backup Slides

SPARC vs MIPS

• • Similar ISA – Similar ALU/Jump and Link/Jump instructions – Similar LD/ST inst. (LDB, LDH, LDW) – Delay branch Except – Branch on 4 condition codes (N, C, O, Z) • E.g. Addcc r1, r2, r3 Bicc address – – Trap on condition code for SW traps (e.g. System call) Register window ( 2-32 windows) • Only 1 window (32 registers) activates, controlled by CWP field in Processor State Register (PSR) • SAVE/RESTORE, RETT, trap will affect the window • SAVE/RESTORE are common used in function call – – No FPU <-> Integer register file transfer instructions Difference in atomic instructions: • MIPS: LL/SC, SPARC: LDSTUB, SWAP