Spring Processor Forum 2005

Download Report

Transcript Spring Processor Forum 2005

An Enhanced 32-Bit
Processor Core for
FPGA Integration
Contents
• MicroBlaze 4.0 and FPU
• Area and Performance Statistics
• Comparison with other processors implemented
on FPGA fabric
• Conclusions
2
Some Virtex FPGA basics
•
•
•
•
•
•
•
3
LUT
LUT + FF
Slice, CLB
Interconnect
BRAMs 18Kbit dual ported memory
PPC 405 processors
High speed serial I/O
Building an FPGA: Logic
First
[A,B,C,D]
4
16 words x 1 bit memory
F
• A 4-input lookup table
(LUT) can implement any
function of 4 inputs.
• For example, a 1-bit adder
needs 2 LUTs:
Co
A.B.Ci
B
A
ABCi
Ci
4
S
Add FF to make a Logic Cell
In
4
16 words x 1 bit memory
Clk
CE
Rst
5
M
M
Out
FF
M
CE
RST
Arithmetic, Distributed RAM
Cout
Carry
4
6
16 words x 1 bit memory
Din
WE
Cin
M
M
M
FF
M
CE
RST
• Make LUT RAM a
user resource.
• Fast carry ripple to
neighbor.
Add Interconnect
4
4
4
4
4
4
4
4
40
7
• Group logic cells to
reduce overhead.
• Add H, V routing
channels with
switchboxes.
• Add input, output
MUXing between logic
and routing.
Build an Array
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
40
40
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
40
8
40
Add Bells & Whistles
Hard
Processor
Gigabit
Serial
18 Bit
I/O
36 Bit
18 Bit
Multiplier
Programmable
Termination
9
BRAM
Clock
Mgmt
CPU Design in FPGA
Instruction Set
• Instruction Set must match the building blocks
– MicroBlaze has 4 logical instructions
• 4-input LUT can process
– 1 bit from two different operands => 2 bits
– 2 bits as the operations => 4 different instructions
– 3 logical instructions would create the same size
– 5 logical instructions would double the size
11
FPGA Logic
• FPGA is very different from ASIC/Custom
– ASIC/Custom building blocks are gates
• 2-and cost 2-and
• 4-xor cost 4-xor and a little bit slower
• 16-bit shift register cost far more than a 2-and
– FPGA building blocks are LUT (CLB)
• 2-and cost 1 LUT
• 4-xor cost 1 LUT with the same speed
• 16-bit shift register also cost 1 LUT
12
Datapath
•
•
•
•
•
13
A multiplexer cost as much as a ALU (1 LUT/bit)
Deep pipeline creates many muxes due to forwarding
Too deep pipeline will runs slower than a shallow
3-4 pipeline stages with full forwarding is optimal
>4 pipeline stages will run slower and be bigger
Implementation
• Need to use the fastest logic in FPGA
– Carry-chain has a delay of 10 ps
– MicroBlaze is trying to put anything timing
critical on the carry-chain
– ex. Jump Detection Logic
– Maximize the usage of LUTRAM, SRL16 and
all features of the flipflops
14
Implementation Characteristics
• FPGAs are really good at networks of ALUs, e.g.
put the whole compute graph in there.
• FPGAs are really good at embedded memory if it
fits
• Multiplexers are relatively expensive
• Different scaling for ALUs and Registerfiles
• Larger processors are not cheap in FPGAs, e.g.
PCC 405, a Philips Trimedia 5-issue VLIW.
15
Jump detection logic
(standard)
~= 2 ns
Reg_Is_Zero
LUT4
Reg(0)
Reg(1)
Reg(2)
Reg(3)
0
1
0
1
MUXCY
INIT=0001
LUT4
Reg(28)
Reg(29)
Reg(30)
Reg(31)
Vcc
INIT=0001
16
MUXCY
LUT
LUT
Jump
Jump detection logic
(optimal)
Jump
LUT
0
1
MultAnd
~= 0.02 ns
LUT
0
1
MultAnd
Reg_Is_Zero
LUT4
Reg(0)
Reg(1)
Reg(2)
Reg(3)
0
1
0
1
MUXCY
INIT=0001
LUT4
Reg(28)
Reg(29)
Reg(30)
Reg(31)
INIT=0001
17
MUXCY
Vcc
Muxing with flipflops
Sel1
In1
Clock
Sel2
In2
IN
Reset
OUT
CE
Set
Reset
OUT
IN
LUT4
CE
Set
Sel3
In3
Reset
OUT
IN
CE
Set
Sel4
In4
IN
Reset
OUT
CE
Set
18
Maintenance
• A soft processor is very configurable
– How to optimize the implementation without
too many variants?
– Avoid to much low-level and only do it when
necessary
– MicroBlaze is a mixture of very detailed
implementation and pure RTL code
19
MicroBlaze v4.00 Block Diagram
Enhanced CPI
Multiplier
FPU
Enhanced
Debug
User Configurable
Options
IXCL_M – Instruction side Xilinx Cache Link Master
IXCL_S – Instruction side Xilinx Cache Link Slave
DXCL_M – Data side Xilinx Cache Link Master
DXCL_S – Data side Xilinx Cache Link Slave
MFSL – Master Fast Simplex Link
SFSL – Slave Fast Simplex Link
20
Increased Clock
Frequency
IOPB – Instruction side On-chip Peripheral Bus
DOPB – Data side On-chip Peripheral Bus
ILMB – Instruction side Local Memory Bus
DLMB – Data side Local Memory Bus
Bus IF – Bus Interface
MicroBlaze v4.00 FPU
• Single-precision floating-point option
– IEEE-754 compatible
• Tightly coupled with CPU
– Up to 120x speedup over software FP emulation
• User-configurable option for MicroBlaze core in
award-winning Xilinx Platform Studio tool suite
• Zero cost when not used
– 1,000 LUTs otherwise
21
Tightly Integrated FPU
• Matched maximum clock frequency
– FPU, MicroBlaze pipeline run at the same frequency
• Low latency
– FP operands use native CPU registers
– FP instructions directly integrated in data flow
• Optimum resource utilization
– Reuse already existing pipeline resources
– FPU adds just 1,000 LUTs to MicroBlaze
=> Best of both worlds:
– Optimized for cost and performance
22
FPU Highlights
FPGA
Virtex-2P
Size
Clock
(MicroBlaze
Frequency
+ FPU)
2120 LUTs
(FPU + MB)
> 100 MHz
Peak
Floating-Point
Throughput
> 16 MFLOPS
Main Hardware Configuration Options
HW Multiplier
HW Exception
FPU
Cache Link
I and D Cache
Debug Logic
Local Instruction Memory
Local Data Memory
Peripherals
8KB Block RAM
8KB Block RAM
OPB, UART
Barrel Shifter
Pattern Compare
Divide
BSB Generated System
23
HW FPU Instructions vs. SW Emulation
MicroBlaze
FPU
Instruction
fadd
Cycles
Software
Function
Instructions
Cycles
Description
6
Addsf3
450
600
Floating-point arithmetic add
frsub
6
subsf3
450
600
Reserve FP arithmetic subtraction
fmul
6
mulsf3
1200
1600
Floating-point arithmetic multiplication
fdiv
30
divsf3
600
750
Floating-point arithmetic division
fcmp.lt
3
ltsf2
350
450
Less-than floating-point comparison
fcmp.eq
3
eqsf2
350
450
Equal floating-point comparison
fcmp.le
3
lesf2
350
450
Less-or-equal floating-point comparison
fcmp.gt
3
gtsf2
350
450
Greater-than floating-point comparison
fcmp.ne
3
nesf2
350
450
Not-equal floating-point comparison
fcmp.ge
3
gesf2
350
450
Greater-or-equal FP comparison
24
Cycle time reduced using
MicroBlaze FPU
FPU Performance vs. Software
120X
120
100
Factor of
speedup
over
software FP
80
50X
60
40
20
6X
0
JPEG
25
FFT
FIR
Complete FP Support
• Loads/stores use standard MicroBlaze instructions
• Infinity, signed zeros follow IEEE-754 conventions
• Software libraries for additional FPU operations
–
–
–
–
Rounding
Square root
Conversions
Other floating-point library functions
=> FPU operations seamlessly supported by standard
programming model
26
MicroBlaze v4.00 Debug Logic
• New debug logic
– Inserts instructions into the pipeline
– Access to anything that instructions can access
– Packetized data-transfer protocol
• Immediate value to users:
– Less debug logic — reduced by 50%
– Faster download — up to 15x faster
– Access to all registers, including ESR, EAR, and FSR
27
MicroBlaze v4.00 Pipeline
OPB
Addr
FF5
Mirror
MX5
PC_OF
>
RdData
WrData
FF7
>
>
MX11
Rd
XCL
cntrl >
MSR
EAR
ESR
FSR
Steer
D-Cache
MX12
BRAM
>
>
cntrl
D-LMB
RegFile
write
Jump
logic
MX1
FF1
>
ALU
+4
>
Shift
>I-Cache
>I-LMB
Barrel Shift
>
MUL
>
>
>
DIV
>
>
>
FPU
>
>
>
FSL get
>
FSL put
>
Ra
MX2
MX3
Pre
Fetch
Rb
Buffer
>
RegFile
read
Imm
MX4
int_vec
brk_vec
expt_vec
IF
28
OF
EX
Up to 16 FSL channels
Performance Improvements:
Bringing It All Together
• Now supporting GCC 3.4.1 unit-at-a-time compile
– Moved up from GCC 2.9 function-at-a-time compile
• New hardware and compiler boost performance
– 0.79 DMIPS/MHz to 0.92 DMIPS/MHz
=> Overall performance benefits to users:
– 16% performance improvement on integer code
– Up to 40% faster string searches
– Up to 120x performance improvement on FP code
29
Configured for Performance
Size
Clock
Frequency
Dhrystone 2.1
1,269 LUTs
180 MHz
166 DMIPS
1,225 LUTs
150 MHz
138 DMIPS
1,318 LUTs
100 MHz
92 DMIPS
FPGA
Virtex-4
(4VLX25-12)
Virtex-II Pro
(2VP20-7)
Spartan-3
(3S1500-5)
Main Hardware Configuration Options
Barrel Shifter
Pattern Compare
Divide
HW Multiplier
HW Exception
FPU
Cache Link
I and D Cache
Debug Logic
Performance Optimized Subsystem
30
Local Instruction Memory
Local Data Memory
Peripherals
8KB Block RAM
16KB Block RAM
GPIO, Timer
Performance
0.92
DMIPS/MHz
0.92
DMIPS/MHz
0.92
DMIPS/MHz
Configured for Frequency
FPGA
Virtex-4
(4VLX40-12)
Virtex-II Pro
(2VP20-7)
Spartan-3
(3S1500-5)
Size
Clock
Frequency
988 LUTs
205 MHz
827 LUTs
170 MHz
983 LUTs
105 MHz
Main Hardware Configuration Options
Barrel Shifter
Pattern Compare
Divide
HW Multiplier
HW Exception
FPU
Cache Link
I and D Cache
Debug Logic
Frequency Optimized Subsystem
31
Local Instruction Memory
Local Data Memory
Peripherals
8KB Block RAM
8KB Block RAM
GPIO
ucLinux System
FPGA
Virtex-2P
Size
Clock
(MicroBlaze
Frequency
+ FPU)
~4000 LUTs
(FPU + MB)
> 100 MHz
Peak
Floating-Point
Throughput
> 16 MFLOPS
Main Hardware Configuration Options
Barrel Shifter
Pattern Compare
Divide
HW Multiplier
HW Exception
FPU
Cache Link
I and D Cache
Debug Logic
System Configuration
32
I Cache
D Cache
Peripherals
8KB RAM
8KB Block RAM
OPB, UART, MCH Memcon
(~1300), EthernetLite (~500)
Tensilica core example
33
http://www.xilinx.com/products/logicore/alliance/tensilica/tensilica_xtensa.pdf
Soft Processor Comparison
120
100
# LUTs
30000
80
20000
60
40
10000
0
•
•
•
•
34
Freq (MHz)
40000
20
MB+FPU
*
Soft
PPC*
Tensilica
Tensilica * (L)
LUTs
2120
33840
12332
29622
Freq (MHz)
100
25
33
29
0
MicroBlaze 4.0 with FPU implemented on V2P FPGA
–
V2P adds hard PowerPC 405 block to Virtex2 FPGA fabric
Soft version of PowerPC 405 implemented on Virtex2 FPGA fabric for emulation
–
No FPU
Tensilica Xtensa V core implemented on Virtex2 FPGA fabric by Tensilica (small and large variants)
–
No FPU
Each V2P70 FPGA on BEE2 Board contains 74448 LUTs
RAMP Interests
• Type B Code
– Parameterized HDL code (address space, datapath, opcodes, etc)
– Xilinx highly interested in outside contributors
• 64 bit address space
– 16, 32, 64, 128 … - HDL code scales
– GCC support of 64 bit address also available
• DP FP
– Xilinx has the operators (DSP library) – SP, DP, in between
– Need to integrate into MB
• Cache Coherency
– Memory controllers are key
– MMU / coherency manager larger than CPU itself
35
Xilinx Interests
• Programming models
–
–
–
–
–
#1 obstacle for wider adoption of FPGAs
Cache coherency, partitioning, MP
RTOS, transactions, MPI, etc
Niagara, Cell, etc
Building large(r) chips is easy, Filling them is more challenging
• Help with (soft) architecture development
– Memory and communication models – tough
– Processor design / architecture – easy, well studied, well published
• Contributing and receiving back
–
–
–
–
We make money selling chips (FPGAs)
MB made possible by (single) processor work of past 20+ years
MB designed with Type B model in mind
Licensing model to support this
• Need help with programming and architecture of MP systems
– Have ideas
– Are willing to help
36
Xilinx University Program (XUP)
• Xilinx interface to both University Teaching
Research
• Academic donation program
and
– All Xilinx Development Tools and IP cores available
for donation
– Hardware donations of boards and/or silicon
– Large Bee2/RAMP participant
• Technical XUP Contact: [email protected]
• http://www.xilinx.com/univ/
37
XUPV2P Development Board
• Curriculum-on-a-ChipTM targets more than basic
Logic Design  Embedded Processing and DSP
• Academic cost: $299
• ~2000 in Universities in about a year
• Model: Develop/Debug design on
XUPV2P, then scale up to Bee2
• http://www.xilinx.com/univ/xupv2p.html
38