SoC Verification through FPGA’s

Download Report

Transcript SoC Verification through FPGA’s

1
System Functionality
Verification using FPGA
경종민
[email protected]
2
Contents
• Section I
– Introduction to reconfigurable computing
– FPGA Logic/Routing architecture
• Section II
– Core-embedded FPGA
– ALTERA/XILINX/TRISCEND/SiDSA
• Section III
– Multiple-FPGA architecture
– Emulation/Simulation acceleration using FPGA’s
3
Introduction
• Design execution methodology
– Hardware
• Very fast & efficient
• No alteration after fabrication
• Expensive process to redesign and refabrication
– Software-programmed processors
• Set of instructions determines a specific operation.
• Functionality can be easily changed.
• Performance is far below that of an ASIC.
4
Reconfigurable Computing
• Fill the gap between hardware and software
– FPGA is an array of computational elements and the
routing wires among them.
– The configuration is determined by programmable
configuration bits.
• Development
– 1963 : Concept of “restructurable computing” appeared.
– 1980’s : FPGA technology developed as a hybrid device
between PALs and MPGA(Mask Programmable Gate
Arrays) by Xilinx, Altera, Lucent, QuickLogic..
– SRAM-programmable FPGA : high density
– 1999-Now : Core-embedded FPGA incorporates both of
programmable processor and FPGA.
5
Logic Block
• LUT-based logic block
– Efficient logic block architecture adopted in many
commercial FPGA’s
– Composed of LUT, DFF(Latch), and mux
I1 I2 I3 I4
Cout
Cin
carry
logic
4-LUT
DFF
Out
6
Logic Block
• 4-LUT
– Any function with 4 input variables can be implemented.
• FF
– Used for pipelining, registers,
– It can be configured for latch by configuration
– Clock signals come from global signals routed on special
resources (Global net)
• Carry logic
– Speed up the carry-based arithmetic functions
– Bypass the routing resources but connected directly to
the neighboring CLB
7
Interconnection Architecture
• Island-style FPGA routing architecture
– Routing architecture of most FPGA architectures
– Sea of routing resources for connection between rows
and columns of logic blocks
– Connection blocks : Programmable multiplexers that
selects the signals in the given routing channel to be
connected to the logic block’s terminal.
– Switch Box: Connections between horizontal and vertical
routing resources
8
Interconnection Architecture
• island-style routing architecture
9
Interconnection Architecture
• Routing resources with various lengths
– Local interconnections : Routing between logical blocks
(ex. dedicated carry chain)
– Medium length lines : Routing wire that runs width of
several logical blocks
– Long lines : Routing wire that runs the whole chip height
or width
– Global lines : Routing wire that runs the entire area of the
chip
• High-speed, low-skew, connections to all logic blocks
• Usually used for clocks, resets.
10
Two Routing Architectures
• Segmented routing architecture
– Local communication traffic by short wires
– Long wires are frequently used to travel long distances
without passing through many switches
– Researches
• How many wires should be contained in each channel?
• How many types of long wires would be efficient?
• Proper portion of each wire type in the whole routing
resources
– Companies : Xilinx, Lucent, Vantis
11
Two Routing Architectures
• Hierarchical routing architecture
– Cluster-based routing architecture
• Routing within a cluster is at the local level, only connecting
within that cluster.
• Longer wires connect different clusters together.
– Each routing level contains several clusters
– Background
• Most connections between logic blocks are local with only
a limited amount of communication traversing long distance
– Good placement algorithm is required.
– Company : ALTERA
12
Two Routing Architectures
Segmented Routing
Hierarchical Routing
Logic blocks
Connection switches
cluster
13
Heterogeneous architecture
• Multiplier embedding
– Multiplier implementation in FPGA is usually inefficient.
– Custom/Configurable hardware for multiplication with
various operand widths and choice of signed/unsigned
can be embedded using a reconfigurable array of FAB’s
(special full adder blocks).
– (Haynes, Field-Programmable Custom Computing Machines,
1998)
14
15
Heterogeneous architecture
• Embedded memory blocks
– Use of available LUTs as RAM structure (Xilinx XC4000,
Virtex FPGAs)
– Dedicated memory blocks within array (Xilinx Virtex and
Altera FPGAs)
16
Xilinx Virtex architecture
Block SelectRAM is embedded inside logic blocks as a column.
17
Heterogeneous Architecture
• Processor embedding
– At late 2000, several commercial FPGA companies have
announced plans to include entire microprocessors.
– Altera
• ARM9-based Excalibur device
– Xilinx
• PowerPC based Virtex-II device
– Triscend
• 8051/ARM based SoC integration platform
18
SoC Verification through
FPGA’s
Core-Embedded FPGA
경종민
[email protected]
19
Core-Embedded FPGA’s
• ALTERA
– ExcaliburTM
• ARM-embedded FPGA
– StratixTM
• Currently without ARM core. Excalibur’s next version is under
development.
• XILINX
– Virtex-II ProTM
• IBM’s PowerPC-embedded FPGA.
• Triscend
– A7
• ARM-embedded FPGA
– E5
• 8051-embedded FPGA
20
ALTERA’s Excalibur
• ARM9 core integrated with FPGA
– “SOPC (System On Programmable Chip)”
– C/C++ compiler/debugger integrated in the FPGA
compiler.
• Interface between processor and FPGA
– AMBA (Advanced Microcontroller Bus Architecture)
– The widely used internal bus architecture for SoC.
– The connection between ARM processor and FPGA block
is done by AMBA bus.
21
ALTERA’s Excalibur
Clock Domain 1 (AHB1)
(up to 180MHz)
Clock Domain 2(AHB2)
(up to 90MHz)
Clock Domain 3 (PLD)
(up to 100MHz)
22
Clock Domain 1 (AHB1)
(up to 180MHz)
Clock Domain 2(AHB2)
(up to 90MHz)
Clock Domain 3 (PLD)
(up to 100MHz)
23
ALTERA’s Excalibur
• AHB1
– Bridge for AHB2
– Interrupt controller,
watchdog timer
– Single Port & Dual Port
SRAM
– The Embedded
processor is the only
bus master on AHB1
24
ALTERA’s Excalibur
• AHB2
– PLD transfers data with
memories, UART or PLD
slave
– Dedicated interfaces
between stripe
(Processor and
Peripherals) and PLD
25
• AHB2
– PLD transfers data with
memories, UART or PLD
slave
– Dedicated interfaces
between stripe
(Processor and
Peripherals) and PLD
26
XILINX’s Virtex-II Pro
• PowerPC core integrated with FPGA
– “Platform FPGA architecture”
– Up to four PPC cores can be integrated.
• Interface between processor and FPGA
– CoreConnect Bus
• PLB (Processor Local Bus)
• DCR (Device Control Register) bus
– OCM(On-Chip Memory) interface
• Dedicated interface between the block RAM and OCM
signals of PPC core.
27
Virtex-II Pro Block Diagram
PowerPC core. This block diagram
contains two PPC cores.
Block RAM and multiplier
blocks
Configurable logic block array
28
PPC Core Block
Block
RAM
Block
RAM
Control
PPC
405
Core
OCM controller is dedicated
interface between PPC and
Block RAM.
Block RAM can be configured
as Instruction-Side Block
RAM(ISBRAM) or Data-Side
Block RAM(DSBRAM).
OCM controller
OCM controller
Block
RAM
DCR bus
Block
RAM
Fixed latency of memory
access guarantees higher
speed execution.
Block RAM can be configured
as dual-port RAM (Data
communication between PPC
and FPGA).
PLB master interface ports are
at the boundary of PPC core.
29
Triscend’s E5/A7
• E5/A7
– “CSoC(Configurable System-on-Chip)”
– E5 contains 8051 core, CSL(Configurable System Logic)
matrix, and peripheral interfaces(JTAG, DMA, Timer,
FIFO)
– A7 contains ARM core instead of 8051.
• CSI (Configurable System Interconnect)
– Bus developed by Triscend.
– Pipelined bus architecture for the performance
optimization
30
Triscend E5/A7
• Bus architecture allows the bus to be expanded
throughout the whole chip while preserving highperformance.
– The internal system bus is extended throughout the
user-configurable system logic.
• Objectives
– Inclusion of any processor is possible.
– High-performance assured regardless of the CSL size
31
Triscend’s A7 Architecture
• CSI Bus
– Configurable System
Interconnect
– Masters of CSI
• ARM
• JTAG(Configuration)
• DMA0, DMA1, DMA2,
DMA3
– Sideband Signals
• Dedicated small # of
signals for UART,
Timer
32
Triscend’s CSL matrix
Vertical/Horizontal Breakers
1.
Vertical : Address Decoder part of CSI
2.
Horizontal : Data read/write port of
CSI
Selector
1.
Decodes address
2.
Registers are arranged in vertical
column of CSL cells
3.
Pre-programmed at the initialization
33
Triscend’s System Architecture
Bus master
requires
grant signals
from arbiter
CPU
CSL
DMA
Bus FIFO/
Arbiter
for
multiple
Masters
RAM
ROM
CPU runs boot
code initially.
Boot code is for
configuring CSL
as well as storing
program/data.
JTAG
Memory
Interface
34
CSI Bus Architecture
Master Write – Address/Data/Control
Slave Write – Address/Data/Control
Master Read – Data/Control
Slave Read – Data/Control
Bus
FIFO
Master
Master
Arbiter
Arbiter
Selectors
and pipe
registers
Selectors
and pipe
registers
Dedicated
Slave
CSL
Dedicated
Slave
CSL
35
Pipelined Write Transaction
Time Slot T+1
Master Write – Address/Data/Control
Slave Write – Address/Data/Control
Master Read – Data/Control
Slave Read – Data/Control
Bus
FIFO
Time Slot T
Master
Master
Arbiter
Arbiter
Selectors
and pipe
registers
Time Slot T+2
Selectors
and pipe
registers
Dedicated
Slave
CSL
Dedicated
Slave
CSL
36
Pipelined Read Transaction
Time
Time Slot
Slot T+1
T+3
Master Write – Address/Data/Control
Slave Write – Address/Data/Control
Master Read – Data/Control
Slave Read – Data/Control
Bus
FIFO
Time Slot T
Master
Master
Arbiter
Arbiter
Selectors
and pipe
registers
Time Slot T+2
Selectors
and pipe
registers
Dedicated
Slave
CSL
Dedicated
Slave
CSL
37
Pipeline in view of Bus Logic
T
T+1
T+2
T+3
Bus FIFO
arbiter
master
Data from
CSL to
Master
Address/
Data
Configure
Selector
Decode
Read from
CSL
38
Wait State
• Why is it generated?
– 1. The handshake operation inside the logic implemented
in CSL.
– 2. CSL logic is too slow to respond in one cycle.
• Sequence of generation
– 1. “Address Selector” in CSL generates wait state if the
system tries to access the Selector’s address.
– 2. If more than one wait state is required, the CSL
function inserts additional wait states.
39
Wait State Insertion
T
T+1
T+2
T+3
Bus fifo
arbiter
master
Data from
CSL to
Master
Address/
Data
Configure
Selector
Decode
Read from
CSL
Waitnow
OR
40
CSL Physical Structure
•
Logic tile
16x8 RAM System Logic
8K
16x8 RAM
RAM
•
Wait Dist.
BankCell
Logic
Wait Dist.
BankCell
Logic
Wait Dist.
BankCell
Logic
Wait Dist.
BankCell
Logic
Wait Dist.
BankCell
Logic
Wait Dist.
BankCell
Logic
Wait Dist.
BankCell
Logic
Wait Dist.
BankCell
Logic
Wait Dist.
BankCell
Logic
Wait Dist.
BankCell
Logic
Wait Dist.
BankCell
Logic
Wait Dist.
BankCell
Logic
Wait Dist.
BankCell
Logic
Wait Dist.
BankCell
Logic
Wait Dist.
BankCell
Logic
Wait Dist.
BankCell
Logic
•
Bus pipeline registers at each
bank boundary  Time slots for
user logic is independent of the
signal transport time between
banks.
The write/read bus is distributed
throughout CSL and buffered and
piped into the bank as shown by
the red arrows.
The wait signals generated from
each bank is propagated to the
pipeline registers in all other
banks.
41
Structure Bank/Bus/Selector
Bank
Selector Selector Selector Selector Selector Selector Selector Selector
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Configured initially for the
selection of the column/wait
generation.
4 wires each tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Horizontal data line writes
data to CSL cell.
The read data is OR’ed to the
horizontal read data line.
42
E5 Physical Implementation
• 8051 CPU core
• 0.35um, 40MHz CSL operation
8051 CPU core and
RAM/ROM
CSL matrix
43
SiDSA’s FIPSOC
• Integration of CAB (Configurable Analog Block)
– 8051 microcontroller
– FPGA
– Configurable analog cells optimized for data acquisition
applications
• Dynamic reconfiguration
– Two configuration bits for each CLB
– User can download extra configuration data while the
cells are in operation.
44
Analog Subsystem
• Configurable Analog Blocks (CAB)
– Differential amplification
– Comparison
– Data conversion (ADC, DAC)
• Digital part
– Digital part to configure CAB is controlled by the mP or
the programmable logic.
45
Comparison
• Xilinx
– Using CoreConnect bus to connect processor and FPGA.
– Multiple processor cores can be used simultaneously.
• ALTERA
– AMBA bus to connect processor and FPGA.
• Triscend
– Processor can read/write any register inside of CSL
matrix. (CSL matrix can be considered as a functional
block of the processor)
– Intensive pipeline schemes adopted to maintain/increase
the throughput, as the latency otherwise caused by the
distributed bus throughout the CSL matrix can be
excessive.