Par Lab OS and Architecture Research
Download
Report
Transcript Par Lab OS and Architecture Research
RAMP Gold Update
Zhangxi Tan, Krste Asanovic, David Patterson
UC Berkeley
August, 2008
A functional model for RAMP Gold
Parlab Manycore
RAMP Gold: A RAMP emulation
model for Parlab manycore
Single-socket tiled manycore
target
Split functional/timing model, both
in hardware
Timing
State
Arch
State
Timing
Model
Pipeline
Functional
Model
Pipeline
SPARC v8 -> v9
Functional model: Executes ISA
Timing model: Capture pipeline timing
detail (can be cycle accurate)
Host multithreading of both
functional and timing models
Built on BEE3 system
Four Xilinx Virtex 5 LX110T
Functional model implementation in this talk
2
A RAMP Emulator
“RAMP blue” as a proof of concept
A bigger “RAMP blue” with more FPGAs for
Parlab?
1,008 32-bit RISC core on 105 FPGAs of 21
BEE2 boards
Less interesting ISA
High-end FPGAs cost thousands of dollars
CPU cores (@90 MHz) are even slower than
memory!
Waste memory bandwidth
High CPI, low pipeline utilization
Poor emulation performance/FPGA
Need a high density and more efficient
design
3
RAMP Gold Implementation
Goal :
Performance : maximize aggregate emulated instruction
throughput (GIPS/FPGA)
Scalability: scale with reasonable resource consumption
Design for FPGA fabric
SRAM nature of FPGA: RAMs are cheap!
Efficient for state storage, but expensive for logic (e.g.
multiplexer)
Need reconsider some traditional RISC optimizations
By passing network is against “smaller, faster” on FPGAs
~28% LUT reduction, ~18% frequency improvement on SPARC v8
implementation, wo result forwarding
DSPs are perfect for ALU implementation
Circuit performance limited by routing
Longer pipeline
Carefully mapped FPGA primitives
Emulation latencies: e.g. across FPGAs, memory network
“High” frequency (targeting 150 MHz)
4
Host multithreading
Single hardware pipeline with multiple copies of CPU state
Fine-grained multithreading
Not multithreading target
Target Model
CPU
1
CPU
2
CPU
63
CPU
64
Functional model on FPGA
PC
PC1
PC
PC1 1
1
Thread
Select
I$
IR
DE
GPR1
GPR1
GPR1
GPR1
X
Y
A
L
U
D$
+1
6
6
6
5
Pipeline Architecture
Thread
Selection
Instruction
Fetch 1
Special Registers
(pc/npc, wim, psr,
thread control
registers)
Static Thread
Selection
Microcode ROM
Instruction Fetch 1
(Round Robin)
Tag/Data read
request
(Issue address Request)
Micro inst.
Instruction
Fetch 2
I-Cache
(nine 18kb
BRAMs)
Instruction Fetch 2
Tag
(compare tag)
Synthesized
Instruction
Tag compare result
Register File
Access 1 & 2*
Mem request
under cache miss
Decode
(Resolve Branch,
Decode register file
address)
Decode
Regfile Read
2 cycles (pipelined)
32-bit
Multithreaded
Register File
(four 36kb
BRAMs)
Register File
Access 3
LUT ROM
MUL/DIV/SHF
(4 DSPs)
LUT RAM (clk x2)
Decode ALU
control/Exception
Detection
pc
imm
DSP (clk x2)
BRAM (clk x2)
OP1
OP2
Execution
128-bit memory
interface
32-bit
Instruction
Single issue in order pipeline
(integer only)
Simple ALU (1 DSP)
/LDST decoding
Special register
handling
(RDPSR/RDWIM)
Physical implementation
Memory 1
Unaligned address
detection / Store
preparation
Issue Load
Tag/Data read
request
(issue address)
128-bit memory
interface
Tag / 128-bit data
Memory 2
Write Back
/ Exception
Trap/IRQ handling
D-Cache
(nine 18kb
BRAMs)
Read & Select
Generate
microcode request
Load align /
Write Back
128-bit read & modify data
11 pipeline stages (no forwarding)
-> 7 logical stages
Static thread scheduling, zero
overhead context switch
Avoid complex operations with
“microcode”
E.g. traps, ST
All BRAM/LUTRAM/DSP blocks in
double clocked or DDR mode
Extra pipeline stages for routing
ECC/Parity protected BRAMs
Deep submicron effect on
FPGAs
6
Implementation Challenges
CPU state storage
Minimize FPGA resource consumption
Where?
How large? Does it fit on FPGA?
E.g. Mapping ALU to DSPs
Host cache & TLB
Need cache?
Architecture and capacity
Bandwidth requirement and R/W access ports
host multithreading amplifies the requirement
7
State storage
Complete 32-bit SPARC v8 ISA w. traps/exceptions
All CPU states (integer only) are stored in SRAMs on FPGA
Per context register file -- BRAM
3 register windows stored in BRAM chunks of 64
8 (global) + 3*16 (reg window) = 54
6 special registers
pc/npc -- LUTRAM
PSR (Processor state register) -- LUTRAM
WIM (Register Window Mask) -- LUTRAM
Y (High 32-bit result for MUL/DIV) -- LUTRAM
TBR (Trap based registers) -- BRAM (packed with regfile)
Buffers for host multithreading (LUTRAM)
Maximum 64 threads per pipeline on Xilinx Virtex5
Bounded by LUTRAM depth (6-input LUTs)
8
Mapping SPARC ALU to DSP
Xilinx DSP48E advantage
48-bit add/sub/logic/mux + pattern detector
Easy to generate ALU flags: < 10 LUTs for C, O
Pipelined access over 500 MHz
9
DSP advantage
Instruction coverage (two DSPs / pipeline)
1 cycle ALU (1 DSP)
LD/ST (address calculation)
Bit-wise logic (and, or, …)
SETHI (value by pass)
JMPL, RETT, CALL (address calculation)
SAVE/RESTORE (add/sub)
WRPSR, RDPSR, RDWIM (XOR op)
Long latency ALU instructions (1 DSP)
Shift/MUL (2 cycles)
5%~10% logic save for 32-bit data path
10
Host Cache/TLB
Accelerating emulation performance!
Per thread cache
Need separate model for target cache
Split I/D direct-map write-allocate write-back cache
Block size: 32 bytes (BEE3 DDR2 controller heart beat)
64-thread configuration: 256B I$, 256B D$
Size doubled in 32-thread configuration
Non-blocking cache, 64 outstanding requests (max)
Physical tags, indexed by virtual or physical address
$ size < page size
67% BRAM usage
Per thread TLB
Split I/D direct-map TLB: 8 entries ITLB, 8 entries DTLB
Dummy currently (VA = PA)
11
Cache-Memory Architecture
Memory
Controller
Memory request
address
Refill
Index
128-bit
data
Victim data
write back
Refill
Data (ECC)
512x72x4
Tag (Parity)
512 x 36
RAMB18SDP
128-bit data
RAMB36SDP (x72)
RAMB36SDP (x72)
RAMB36SDP (x72)
Mem ops
RAMB36SDP (x72)
Memory
Command FIFO
Lookup
Index
64-bit data
+ Tag
Tag
Write
Back
64-bit data
Read & Modify
Prepare LD/ST
address
Load Select / Routing
Cache FSM
(Hit, exception, etc)
Cache
replay?
Integer Pipeline
Pipeline State
Control
Memory Stage (1)
Memory Stage (2)
Load Align/Sign
Pipeline Register
Exception/Write
Back Stage
Pipeline Register
Cache controller
Non-blocking pipelined access (3-stages) matches CPU pipeline
Decoupled access/refill: allow pipelined, OOO mem accesses
Tell the pipeline to “replay” inst. on miss
128-bit refill/write back data path
fill one block in 2 cycles
12
Example: A distributed memory non-cache coherent system
Eight multithreaded SPARC v8 pipelines in
two clusters
Memory subsystem
Each thread emulates one independent node
in target system
512 nodes/FPGA
Predicted emulation performance:
~1 GIPS/FPGA (10% I$ miss, 30% D$
miss, 30% LD/ST)
x2 compared to naïve manycore
implementation
Total memory capacity 16 GB, 32MB/node
(512 nodes)
One DDR2 memory controller per cluster
Per FPGA bandwidth: 7.2 GB/s
Memory space is partitioned to emulate
distributed memory system
144-bit wide credit-based memory network
Inter-node communication (under
development)
Two-level tree network to provide all-to-all
communication
13
Project Status
Done with RTL implementation
~7,200 lines synthesizable Systemverilog code
FPGA resource utilization per pipeline on Xilinx V5 LX110T
~3% logic (LUT), ~10% BRAM
Max 10 pipelines, but back off to 8 or less for timing
model
Built RTL verification infrastructure
SPARC v8 certification test suite (donated by SPARC
international) + Systemverilog
Can be used to run more programs but very slow
(~0.3 KIPS)
14
RTL Verification Flow in SW
SPARC V8
Verification Suite
(.S or .C)
Customized Linker
Script
(.lds)
GNU SPARC v8
Compiler/Linker
(sparc-linux-gcc, sparclinux-as, sparc-linux-ld)
Host disassembler
C Implementation
ELF big-endian
Binaries
ELF to BRAM
Translator
RAMP Gold
Systemverilog Source
Files / netlist
(.sv, .v)
Xilinx Unisim Library
SPARC v8
Disassembler
(host binary)
GNU libbfd library
(from GNU binutil)
Systemverilog DPI interface
Modelsim SE/Questasim 6.4
Simulation log/
Console output
Checker
15
Verification in progress
Tested instructions
All SPARC v7 ALU instructions: add/sub, logic, shift
All integer branch instructions
All special instructions: register window, system
registers
Working on: LD/ST and Trap
More verification after P&R and on HW
work with the rest RAMP Gold infrastructure
Lessons so far
Infrastructure is not trivial, and very few sample
design available (have to build our own!)
Multithreaded states complicates the verification
process!
buffers and shared FU interfaces
16
Thank you
17