Par Lab OS and Architecture Research

Download Report

Transcript Par Lab OS and Architecture Research

RAMP Gold Update
Zhangxi Tan, Krste Asanovic, David Patterson
UC Berkeley
August, 2008
A functional model for RAMP Gold
Parlab Manycore

RAMP Gold: A RAMP emulation
model for Parlab manycore

Single-socket tiled manycore
target


Split functional/timing model, both
in hardware


Timing
State
Arch
State


Timing
Model
Pipeline
Functional
Model
Pipeline
SPARC v8 -> v9
Functional model: Executes ISA
Timing model: Capture pipeline timing
detail (can be cycle accurate)
Host multithreading of both
functional and timing models
Built on BEE3 system

Four Xilinx Virtex 5 LX110T
Functional model implementation in this talk
2
A RAMP Emulator

“RAMP blue” as a proof of concept


A bigger “RAMP blue” with more FPGAs for
Parlab?





1,008 32-bit RISC core on 105 FPGAs of 21
BEE2 boards
Less interesting ISA
High-end FPGAs cost thousands of dollars
CPU cores (@90 MHz) are even slower than
memory!
 Waste memory bandwidth
High CPI, low pipeline utilization
 Poor emulation performance/FPGA
Need a high density and more efficient
design
3
RAMP Gold Implementation

Goal :



Performance : maximize aggregate emulated instruction
throughput (GIPS/FPGA)
Scalability: scale with reasonable resource consumption
Design for FPGA fabric

SRAM nature of FPGA: RAMs are cheap!
 Efficient for state storage, but expensive for logic (e.g.
multiplexer)
 Need reconsider some traditional RISC optimizations


By passing network is against “smaller, faster” on FPGAs
~28% LUT reduction, ~18% frequency improvement on SPARC v8
implementation, wo result forwarding
DSPs are perfect for ALU implementation
Circuit performance limited by routing
 Longer pipeline
 Carefully mapped FPGA primitives
 Emulation latencies: e.g. across FPGAs, memory network
 “High” frequency (targeting 150 MHz)


4
Host multithreading

Single hardware pipeline with multiple copies of CPU state


Fine-grained multithreading
Not multithreading target
Target Model
CPU
1
CPU
2
CPU
63
CPU
64
Functional model on FPGA
PC
PC1
PC
PC1 1
1
Thread
Select
I$
IR
DE
GPR1
GPR1
GPR1
GPR1
X
Y
A
L
U
D$
+1
6
6
6
5
Pipeline Architecture
Thread
Selection
Instruction
Fetch 1

Special Registers
(pc/npc, wim, psr,
thread control
registers)
Static Thread
Selection
Microcode ROM
Instruction Fetch 1
(Round Robin)
Tag/Data read
request
(Issue address Request)
Micro inst.
Instruction
Fetch 2
I-Cache
(nine 18kb
BRAMs)
Instruction Fetch 2
Tag
(compare tag)
Synthesized
Instruction
Tag compare result
Register File
Access 1 & 2*

Mem request
under cache miss
Decode
(Resolve Branch,
Decode register file
address)
Decode

Regfile Read
2 cycles (pipelined)
32-bit
Multithreaded
Register File
(four 36kb
BRAMs)
Register File
Access 3
LUT ROM
MUL/DIV/SHF
(4 DSPs)
LUT RAM (clk x2)
Decode ALU
control/Exception
Detection
pc
imm
DSP (clk x2)
BRAM (clk x2)
OP1
OP2
Execution

128-bit memory
interface
32-bit
Instruction
Single issue in order pipeline
(integer only)
Simple ALU (1 DSP)
/LDST decoding

Special register
handling
(RDPSR/RDWIM)
Physical implementation

Memory 1
Unaligned address
detection / Store
preparation
Issue Load
Tag/Data read
request
(issue address)
128-bit memory
interface
Tag / 128-bit data
Memory 2
Write Back
/ Exception
Trap/IRQ handling
D-Cache
(nine 18kb
BRAMs)

Read & Select

Generate
microcode request
Load align /
Write Back
128-bit read & modify data
11 pipeline stages (no forwarding)
-> 7 logical stages
Static thread scheduling, zero
overhead context switch
Avoid complex operations with
“microcode”
 E.g. traps, ST
All BRAM/LUTRAM/DSP blocks in
double clocked or DDR mode
Extra pipeline stages for routing
ECC/Parity protected BRAMs
 Deep submicron effect on
FPGAs
6
Implementation Challenges

CPU state storage



Minimize FPGA resource consumption


Where?
How large? Does it fit on FPGA?
E.g. Mapping ALU to DSPs
Host cache & TLB



Need cache?
Architecture and capacity
Bandwidth requirement and R/W access ports
 host multithreading amplifies the requirement
7
State storage


Complete 32-bit SPARC v8 ISA w. traps/exceptions
All CPU states (integer only) are stored in SRAMs on FPGA




Per context register file -- BRAM
 3 register windows stored in BRAM chunks of 64
 8 (global) + 3*16 (reg window) = 54
6 special registers
 pc/npc -- LUTRAM
 PSR (Processor state register) -- LUTRAM
 WIM (Register Window Mask) -- LUTRAM
 Y (High 32-bit result for MUL/DIV) -- LUTRAM
 TBR (Trap based registers) -- BRAM (packed with regfile)
Buffers for host multithreading (LUTRAM)
Maximum 64 threads per pipeline on Xilinx Virtex5

Bounded by LUTRAM depth (6-input LUTs)
8
Mapping SPARC ALU to DSP

Xilinx DSP48E advantage
 48-bit add/sub/logic/mux + pattern detector
 Easy to generate ALU flags: < 10 LUTs for C, O
 Pipelined access over 500 MHz
9
DSP advantage


Instruction coverage (two DSPs / pipeline)
 1 cycle ALU (1 DSP)
 LD/ST (address calculation)
 Bit-wise logic (and, or, …)
 SETHI (value by pass)
 JMPL, RETT, CALL (address calculation)
 SAVE/RESTORE (add/sub)
 WRPSR, RDPSR, RDWIM (XOR op)
 Long latency ALU instructions (1 DSP)
 Shift/MUL (2 cycles)
5%~10% logic save for 32-bit data path
10
Host Cache/TLB

Accelerating emulation performance!


Per thread cache



Need separate model for target cache
Split I/D direct-map write-allocate write-back cache
 Block size: 32 bytes (BEE3 DDR2 controller heart beat)
 64-thread configuration: 256B I$, 256B D$
 Size doubled in 32-thread configuration
 Non-blocking cache, 64 outstanding requests (max)
 Physical tags, indexed by virtual or physical address
 $ size < page size
67% BRAM usage
Per thread TLB


Split I/D direct-map TLB: 8 entries ITLB, 8 entries DTLB
Dummy currently (VA = PA)
11
Cache-Memory Architecture
Memory
Controller
Memory request
address
Refill
Index
128-bit
data
Victim data
write back
Refill
Data (ECC)
512x72x4
Tag (Parity)
512 x 36
RAMB18SDP
128-bit data
RAMB36SDP (x72)
RAMB36SDP (x72)
RAMB36SDP (x72)
Mem ops
RAMB36SDP (x72)
Memory
Command FIFO
Lookup
Index
64-bit data
+ Tag
Tag
Write
Back
64-bit data
Read & Modify
Prepare LD/ST
address
Load Select / Routing
Cache FSM
(Hit, exception, etc)
Cache
replay?
Integer Pipeline
Pipeline State
Control
Memory Stage (1)
Memory Stage (2)
Load Align/Sign
Pipeline Register

Exception/Write
Back Stage
Pipeline Register
Cache controller




Non-blocking pipelined access (3-stages) matches CPU pipeline
Decoupled access/refill: allow pipelined, OOO mem accesses
Tell the pipeline to “replay” inst. on miss
128-bit refill/write back data path
 fill one block in 2 cycles
12
Example: A distributed memory non-cache coherent system

Eight multithreaded SPARC v8 pipelines in
two clusters




Memory subsystem






Each thread emulates one independent node
in target system
512 nodes/FPGA
Predicted emulation performance:
 ~1 GIPS/FPGA (10% I$ miss, 30% D$
miss, 30% LD/ST)
 x2 compared to naïve manycore
implementation
Total memory capacity 16 GB, 32MB/node
(512 nodes)
One DDR2 memory controller per cluster
Per FPGA bandwidth: 7.2 GB/s
Memory space is partitioned to emulate
distributed memory system
144-bit wide credit-based memory network
Inter-node communication (under
development)

Two-level tree network to provide all-to-all
communication
13
Project Status

Done with RTL implementation
 ~7,200 lines synthesizable Systemverilog code
 FPGA resource utilization per pipeline on Xilinx V5 LX110T
 ~3% logic (LUT), ~10% BRAM
 Max 10 pipelines, but back off to 8 or less for timing
model

Built RTL verification infrastructure
 SPARC v8 certification test suite (donated by SPARC
international) + Systemverilog
 Can be used to run more programs but very slow
(~0.3 KIPS)
14
RTL Verification Flow in SW
SPARC V8
Verification Suite
(.S or .C)
Customized Linker
Script
(.lds)
GNU SPARC v8
Compiler/Linker
(sparc-linux-gcc, sparclinux-as, sparc-linux-ld)
Host disassembler
C Implementation
ELF big-endian
Binaries
ELF to BRAM
Translator
RAMP Gold
Systemverilog Source
Files / netlist
(.sv, .v)
Xilinx Unisim Library
SPARC v8
Disassembler
(host binary)
GNU libbfd library
(from GNU binutil)
Systemverilog DPI interface
Modelsim SE/Questasim 6.4
Simulation log/
Console output
Checker
15
Verification in progress




Tested instructions
 All SPARC v7 ALU instructions: add/sub, logic, shift
 All integer branch instructions
 All special instructions: register window, system
registers
Working on: LD/ST and Trap
More verification after P&R and on HW
 work with the rest RAMP Gold infrastructure
Lessons so far
 Infrastructure is not trivial, and very few sample
design available (have to build our own!)
 Multithreaded states complicates the verification
process!
 buffers and shared FU interfaces
16
Thank you
17