Pamięci - AGH University of Science and Technology

Download Report

Transcript Pamięci - AGH University of Science and Technology

Programmable Logic Devices
Ernest Jamro
Dept. Electronics
AGH UST, Kraków Poland
PLD as a Black Box
Inputs
(logic variables)
Logic gates
and
programmable
switches
Outputs
(logic functions)
Programmable Logic Array (PLA)
–The connections in
the AND plane are
programmable
–The connections in
the OR plane are
programmable
x1 x2
xn
Input buffers
and
inverters
x1 x1
xn xn
P1
AND plane
OR plane
Pk
f1
fm
Gate Level Version of PLA
x1
x2
x3
Programmable
connections
f1 = x1x2+x1x3'+x1'x2'x3
OR plane
P1
f2 = x1x2+x1'x2'x3+x1x3
P2
P3
P4
AND plane
f1
f2
Customary Schematic of a PLA
x1
x2
x3
OR plane
f1 = x1x2+x1x3'+x1'x2'x3
P1
f2 = x1x2+x1'x2'x3+x1x3
P2
P3
P4
x marks the connections left in
place after programming
AND plane
f1
f2
AND Plane Implementation with
Floating Gate Transistors
VDD
VDD
VDD
In0
In0'
InN-1'
Out0
Out1
OutM-1
Programmable Array Logic (PAL)
x1 x2
– The connections in
the AND plane are
programmable
xn
Input buffers
and
inverters
x1 x1
fixed connections
xn xn
P1
– The connections in
the OR plane are
NOT programmable
AND plane
OR plane
Pk
f1
fm
Example Schematic of a PAL
x1
x2
x3
f1 = x1x2x3'+x1'x2x3
P1
f2 = x1'x2'+x1x2x3
f1
P2
P3
f2
P4
AND plane
Macrocell
Select
OR gate from PAL
0
1
D
Q
Flip-flop
Clock
back to AND plane
Enable
f1
Macrocell Functions
– Enable = 0 can be used to allow the output pin for f1 to
be used as an additional input pin to the PAL
– Enable = 1, Select = 0 is normal
for typical PAL operation
Select
0
1
– Enable = Select = 1 allows
the PAL to synchronize the
output changes with a clock
pulse
D Q
Clock
back to AND plane
– The feedback to the AND plane provides for multilevel design
Enable
f1
Multi-Level Design with PALs
f = A'BC + A'B'C' + ABC' + AB'C = A'g + Ag'
•
A
where g = BC + B'C' and C = h below
B
Sel = 0
En = 0
0
h
1
D Q
Sel = 0
Clock
0
1
En = 1
g
D Q
Select
Clock
0
1
D Q
Clock
f
CPLD
– Complex Programmable Logic Devices (CPLD)
– SPLDs (PLA, PAL) are limited in size due to the small
number of input and output pins and the limited
number of product terms
•
Combined number of inputs + outputs < 32 or so
– CPLDs contain multiple circuit blocks on a single chip
•
•
•
Each block is like a PAL: PAL-like block
Connections are provided between PAL-like blocks via an
interconnection network that is programmable
Each block is connected to an I/O block as well
PAL-like
block
PAL-like
block
I/O block
I/O block
Structure of a CPLD
PAL-like
block
PAL-like
block
I/O block
I/O block
Interconnection wires
Internal Structure of a PAL-like Block
–
Includes macrocells
•
–
Usually about 16 each
PAL-like block
Fixed OR planes
•
OR gates have fan-in
between 5-20
PAL-like block
DQ
–
XOR gates provide
negation ability
•
XOR has a control
input
DQ
DQ
Programming a CPLD
CPLDs have many pins – large ones have > 200
Removal of CPLD from a PCB is difficult without breaking the pins
Use ISP (in system programming) to program the CPLD
JTAG (Joint Test Action Group) port used to connect the CPLD to a
computer
FPGA Principles
• A Field-Programmable Gate Array (FPGA) is an
integrated circuit that can be configured by the
user to emulate any digital circuit as long as there
are enough resources
• An FPGA can be seen as an array of Configurable
Logic Blocks (CLBs) connected through
programmable interconnect (Switch Boxes)
Copy from dr. Konstantinos Tatas
[email protected] http://staff.fit.ac.cy/com.tk
FPGA structure
CLB
SB
CLB
SB
SB
SB
Configurable Logic Blocks
CLB
SB
CLB
Interconnection Network
I/O Signals (Pins)
Simplified CLB Structure
Look-Up
Table
(LUT)
CLB
SB
CLB
SB
SB
SB
MUX
D
SET
CLR
Configurable Logic Blocks
CLB
SB
CLB
Interconnection Network
I/O Signals (Pins)
Q
Q
Programmable Logic
lab example
GND
Board
VDD
GND
SW0
FPGA
SW7
0
7
mux
ch1
ch2
ch3
to scope
VDD
ch4
Example of RAM: 4-input AND gate
A
B
O
C
D
A
B
C
D
O
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
1
0
0
1
1
0
0
0
1
1
1
0
1
0
0
0
0
1
0
0
1
0
1
0
1
0
0
1
0
1
1
0
1
1
0
0
0
1
1
0
1
0
1
1
1
0
0
1
1
1
1
1
A
B
C
D
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
MUX
D
SET
CLR
Q
Q
0
Configuration bits
O
Example 2: Find the configuration
bits for the following circuit
A0
2-to-1
MUX
D
SET
Q
A1
CLR
Q
A0
S
Clock
A1
A0
A1
S
0
0
0
0
0
1
0
1
0
0
1
1
1
0
0
1
0
1
1
1
0
1
1
1
MUX
D
SET
Q
S
CLR
Q
Configuration bits
Interconnection Network
Configuration
bits 0
1
0
CLB
SB
CLB
0
0
SB
SB
CLB
SB
SB
Configurable Logic Blocks
CLB
Interconnection Network
I/O Signals (Pins)
0
Example 3
• Determine the configuration bits for the following circuit implementation
in a 2x2 FPGA, with I/O constraints as shown in the following figure.
Assume 2-input LUTs in each CLB.
Input1
Input2
CLB0
SB0
CLB1
Input1
Input2
SB1
SB2
SB3
CLB2
SB4
CLB3
Input3
Input3
Output
D
SET
CLR
Q
Q
Output
CLBs required
CLB 2
CLB 1
Input1
Input2
D
SET
CLR
Q
Output
Q
Input3
0
MUX
Input1
0
Input2
0
D
SET
CLR
O
Q
Q
1
0
MUX
O
1
Input3
1
D
SET
CLR
1
Q
Q
0
Configuration bits
Configuration bits
0
Output
Placement: Select CLBs
Input1
Input2
CLB0
SB0
CLB1
SB1
SB2
SB3
CLB2
SB4
CLB3
Input3
Output
Routing: Select path
Input1
SB1
Configuration bits
Input2
CLB0
SB0
CLB1
0
0
0
1
0
0
SB1
SB2
SB3
SB4
Configuration bits
Input3
CLB2
SB4
CLB3
Output
0
0
1
0
0
0
Configuration Bitstream
• The configuration bitstream must include ALL CLBs and
SBs, even unused ones
• CLB0: 00011
• CLB1: 01100
• CLB2: XXXXX
• CLB3: ?????
• SB0: 000000
• SB1: 000010
• SB2: 000000
• SB3: 000000
• SB4: 000001
The Virtex CLB
Details of One Virtex Slice
Implements any Two 4-input
Functions
4-input
function
3-input
function;
registered
Implements any 5-input Function
5-input
function
Implement Some Larger
Functions
e.g. 9-input
Two Slices: Any 6-input Function
from
other
slice
6-input
function
Ripple Carry Adder
a3
b3
FA
Full Adder
Sumator
a2
c3
s3
b2
FA
Full Adder
Sumator
s2
si
ci-1\ai,bi
00
01
11
10
0
0
1
0
1
1
1
0
1
0
ci
ci-1\ai,bi
00
01
11
10
0
0
0
1
0
1
0
1
1
1
a1
c1
b1
FA
Full Adder
Sumator
s1
a0
c0
b0
HA
Half Adder
Pół Sumator
ai + bi+ci-1 = si + 2·ci
s0
si = ai  bi  ci-1
ci= ai bi + ai ci-1 + bi ci-1= ai bi + ci-1 (ai  bi)
ci 1 if ai  bi  1
ci  
ai if ai  bi  0
ci 1 if ai  bi  1

ci  1 if ai bi  1
0 if a b  0
i i

Ripple Carry Adders in FPGAs
ci 1 if ai  bi  1
ci  
ai if ai  bi  0
si= ai  bi  ci-1
Fragment of Virtex Configurable Logic Block (CLB)
Lookup Tables used as memory
(16 x 2) Distributed Memory
Lookup Tables used as memory
(32 x 1)
Virtex-5 Logic Architecture
Advanced logic structure
– True 6-input LUTs
– Exclusive 64-bit distributed RAM
option per LUT
– Exclusive 32-bit or
16-bit x 2 shift register
RAM64
SRL32
LUT6
Register/
Latch
RAM64
SRL32
LUT6
Register/
Latch
RAM64
SRL32
LUT6
Register/
Latch
RAM64
SRL32
LUT6
Register/
Latch
New Advanced Logic Structure
• Improved slice
– Four LUT6s & FFs per slice
– Better local connection
• True 6-input LUTs
– Higher performance
– Best logic compaction
– Wide logic functions without
MUX delays
• 65% higher capacity and
one to two speed grades faster
than Virtex-4 (4 inputs LUTs)
LUT6
LUT6
LUT6
LUT6
Slice
Logic Compaction with LUT6
Use Fewer LUTs, Faster, Less Routing
64 bit RAM
8 to 1 Multiplexer
LUT4
LUT6
LUT4
LUT6
New 6-Input LUT with Two Outputs
• True 6-input LUT
– Any function of 6 variables
– No input shared with other LUTs
A6
A5
A4
A3
A2
A1
O6
O5
• Second output adds functionality
– Reduces average slice count by 10%
– 2 independent functions of 5 variables
– 1 function of 6 variables plus
1 subfunction of 5 variables
– 1 function of 3 variables plus
1 function of 2 other variables
– Plus other combinations of subfunctions...
6-input LUT with 2 outputs
Virtex-5 Memory Options…
The Right Memory for the Application
Distributed RAM/SRL32
On-chip BRAM/FIFO
Fast Memory Interfaces
RAM / SRL 32
DRAM
BRAM/FIFO
Granularity
SRAM
• Sync SRAM
FLASH • DDR SRAM
• ZBT
• QDR
LOGIC
• Very granular, localized memory
• Minimal impact on logic routing
• Great for small FIFOs
Virtex-5
DRAM • SDRAM
• DDR SDRAM
• FCRAM
SRAM • RLDRAM
EEPROM FLASH
EEPROM
• Efficient, on-chip blocks
• Flexible + optional FIFO logic
• Ideal for mid-sized FIFOs/buffers
• Cost-effective bulk storage
• Memory controller cores
• Large memory requirements
Capacity
Distributed RAM
• Distributed LUT memory
– 64-bit blocks throughout the FPGA
– Single-port, dual-port, multi-port
– Can be used as 32-bit shift register
Slice3
Logic
Slice3
Logic
• Very fast (sub-nanosecond)
– Tightly coupled to logic
Slice3
Logic
Slice3
Logic RAM
Shift Register
Slice3
Logic RAM
Shift Register
Slice3
Logic
Logic RAM
Shift Register
Slice3
Logic
Slice3
Logic
Slice3
Logic
Slice3
Logic RAM
Shift Register
R
A
M
R
A
M
Slice3
Logic RAM
Shift Register
Slice3
Logic RAM
Shift Register
R
A
M
• Synchronous write, asynchronous read
Distributed memory can be placed anywhere in the FPGA
32-bit Shift Registers in 1 LUT
• Length is dynamically determined by the A inputs
D
CLK
32-bit Shift register
Q 31
32
A
6
MUX
Qn
Convenient way to dynamically change LUT content
BRAM/FIFO Features
Independent read and write port widths
• Multiple configurations
– True dual-port, simple dual-port, single-port
• Integrated logic for fast and efficient FIFOs
• Synchronous write and read
or
Dual-Port
BRAM
Each RAM block can be configured as BRAM or FIFO
FIFO
BRAM Mode Top Level View
• True dual port – unrestricted flexibility
– Read and write operations simultaneously
and independently on port A and port B
– 32Kx1, 16Kx2, 8Kx4, 4Kx9, 2Kx18, 1Kx36
Addr A
Port A
36
Wdata A
36
Rdata A
36Kb
Memory
Array
• Each port can have different width
Addr B
36
Wdata B
Port B
Rdata B
In one clock cycle, 4 total operations can be performed
36
Block RAM
Virtex IOB
Virtex 7 IOB
Differential / Single Ended Standards
Virtex 7 IOB
Digitally Controlled Impedance (DCI)
IO Standards
Bank
Bank
Region
Bank
Bank
Bank
Bank
Region
Region
Region
Region
CMT
Region
Bank
Region
Bank
Region
CMT
Bank
Region
Bank
Region
CMT
Bank
Region
Bank
Region
Bank
GClk
Region
Bank
GClk
– Each bank has a seperate supply voltage
(in order to suport different IO
standards)
Bank
Region
• Many banks per device:
Bank
Bank
Bank
Bank
CMT
Bank CFG Bank
CFG
CMT
Region
LX330 Layout
Bank
Bank
Bank
Bank
Bank
Bank
Region
CMT
Bank
Region
CMT
Bank
Region
Region
Bank
Region
Region
Bank
Region
Bank
Region
Region
Bank
LX30
Layout
Region
Bank
I/O Banking Architecture
GClk
Edge-Aligned DDR Inputs,
Opposite-Edge
D
DATA
QA
FPGA Fabric
CLK
D
QB
0
1
SelectIO™
CLK
DATA
QA
QA’
QB
QB’
1
1
0
0
1
0
0
1
0
1
1
0
0
1
0
1
1
0
0
1
1
1
1
0
1
Need Frequency Conversion
Internal Must Be Lower Than External
1 Gbps
1 Gbps
n
Xn
FPGA Fabric
CLK
DDR Data
4
FF
FF
FF
FF
RLOCs or LOCs
Directed Routing Constraints
Timing can be tricky
FF
FF
FF
FF
FF
FF
FF
FF
4
Skew Affects Setup and Hold Times
Clock
Data
Connector
Source
CLK
DATA1
DATA2
tSU1
tSU2
Target
tH1
tH2
Channel Timing Can Create
Additional Clock Domains
Channel 1
Alignment
Frequency
Reduction
Fast Unaligned 1
Channel 2
Alignment
Frequency
Reduction
Fast Unaligned 2
Channel 3
Alignment
Frequency
Reduction
Fast Unaligned 3 Fast Aligned
Slow Aligned
ISERDES Manages
Incoming Data
ChipSync™
Data
ISERDES
BUFIO
FPGA Fabric
CLKDIV
CLK
CLK
n
÷
BUFR
• Frequency division
– Data width to 10 bits
• Dynamic signal alignment
–
–
–
–
Bit alignment
Word alignment
Clock alignment
Supports Dynamic Phase Alignment
(DPA)
Easy Bit Alignment
ChipSync™
CLK
FPGA Fabric
DATA
IDELAY
INC/DEC
ISERDES
State
Machin
e
190-210 MHz
(calibration clk)
IDELAY CNTRL
• 64 delay elements of ~ 70 to 89 ps each
OSERDES Simplifies
Frequency Multiplication
ChipSync
n
OSERDES
CLK
m
CLKDIV
DCM/PMCD
FPGA Fabric
Gigabit Serial Signaling is Everywhere
• Serial is faster than parallel
– Very high multi-gigabit data rates
– Embedded clock avoids clock/data
skew
– Reduction in EMI & power
consumption
• The preferred choice in many
markets
– Telecom, datacom, computing,
storage video/imaging,
instrumentation, etc.
– Dominating all new standards
activities
100%
75%
50%
92%
64%
25%
0
%
2005
2006
Percentage of Engineers
Designing Serial IO Systems
Source: EE Times Survey, 2005
Serial transceivers must be flexible, robust and easy to use
The Gigabit Transceiver
Tx
PMA
PCS
Rx
PMA
PCS
GTP Transceiver
• 8 to 96 transceivers per device
• Supporting data rates to 28 Gbps
FPGA
Fabric
Interface
Virtex-5 Delivers Powerful Clock
Management
– DCMs (Digital Clock
Manager) – based on DLL
(Delay Lock Loop)
– PLLs
DCM
PLL
Clock
Buffers
• Combination digital and
analog technology
Select by:
Function
• Highest performance
– 550MHz global clocking
– More than 2x jitter filtering
Component
Automatic
HDL code
Virtex-5 Clock Management Tile
• Up to 6 CMTs per device
– Each with 2 DCMs and 1 PLL
CMT
• DCM
– 5th generation all-digital technology
– Provides most clocking functions
• PLL
– Reduces internal clock jitter
– Supports higher jitter on reference
clocks
– Replaces discrete PLLs and VCOs
Powerful combination of flexibility and precision
Filter Jitter Using the Virtex-5 PLL
PLL Input Clock
>400ps pk-pk jitter
PLL Output Clock
<100ps pk-pk jitter
• 400MHz noisy clock
• Quiet FPGA
Typical Waveform Examples
DCM (Digital Clock Manager)
Features
DCM_ADV
CLKIN
CLKFB
• Operate from 19 MHz – 550 MHz
• Remove clock insertion delay
DCM_BASE
CLKIN
CLKFB
– “Zero delay clock buffer”
• Correct clock duty cycles
• Synthesize Fout = Fin * M/D
– M, D values up to 32
RST
CLKO
CLK90
CLK180
CLK270
CLK2X
CLK2X180
CLKDV
CLKFX
CLKFX180
LOCKED
CLKO
CLK90
CLK180
CLK270
Phase
Shift
CLK2X
CLK2X180
DRP
CLKDV
CLKFX
CLKFX180
LOCKED
RST
• Additional DCM_ADV features
– Dynamically phase shift clocks in
increments of period/256 or with direct
delay line control
– Use Dynamic Reconfiguration Port to
adjust parameters without reconfiguring
Each DCM can be invoked with either
the DCM_BASE or DCM_ADV primitive
DCM in VHDL
Library UNISIM;
use UNISIM.vcomponents.all;
-- DCM_SP: Digital Clock Manager
port map (
-- Spartan-6
CLK0 => CLK0, -- 1-bit Same frequency as CLKIN, 0 degree phase
shift.
-- Xilinx HDL Libraries Guide, version 11.2
DCM_SP_inst : DCM_SP
CLK180 => CLK180, -- 1-bit Same frequency as CLKIN, 180 degree
phase shift.
generic map (
CLK270 => CLK270, -- 1-bit Same frequency as CLKIN, 180 degree
phase
CLKDV_DIVIDE => 2.0, -- Specifies the extent to which the CLKDLL, CLKDLLE, CLKDLLHF,
or shift.
CLK2X => CLK2X, -- 1-bit Two times CLKIN frequency clock,
-- DCM_SP clock divider (CLKDV output) is to be frequency divided.
aligned with CLK0.
CLKFX_DIVIDE => 1, -- Specifies the frequency divider value for the CLKFX output.
CLK2X180 => CLK2X180, -- 1-bit 180 degree shifted version of
CLKFX_MULTIPLY => 4, -- Specifies the frequency multiplier value for the CLKFX output.
the CLK2X clock.
CLKIN_DIVIDE_BY_2 => FALSE, -- Enables CLKIN divide by two features.
CLK90 => CLK90, -- 1-bit Same frequency as CLKIN, 90 degree
CLKIN_PERIOD => "10.0", -- Specifies the input period to the DCM_SP CLKIN input in phase
ns. shift.
CLKDV
=>phase
CLKDV, -- 1-bit Divided version of CLK0. Divide value
CLKOUT_PHASE_SHIFT => "NONE", -- This attribute specifies the phase shift mode. NONE
= No
is programmable.
-- shift capability. Any set value has no effect. FIXED = DCM
CLKFX => CLKFX, -- 1-bit Digital Frequency Synthesizer output
-- outputs are a fixed phase shift from CLKIN. Value is specified
(DFS).
-- by PHASE_SHIFT attribute. VARIABLE = Allows the DCM outputs to
-- be shifted in a positive and negative range relative to CLKIN.
CLKFX180 => CLKFX180, -- 1-bit 180 degree shifted version of
the CLKFX clock.
LOCKED => LOCKED, -- 1-bit Signal indicating when the DCM has
LOCKed.
CLK_FEEDBACK => "1X", -- Defines the DCM feedbcak mode. 1X: CLK0 as feedback 2X: CLK2X
PSDONE => PSDONE, -- 1-bit Output signal that indicates
-- as feedback.
variable phase shift is done.
-- Starting value is specified by PHASE_SHIFT.
DESKEW_ADJUST => "SYSTEM_SYNCHRONOUS", -- Sets configuration bits affecting the clock
delay
STATUS
=> alignment
STATUS, -- 8-bit DCM Status Bits
-- between the DCM_SP output clocks and an FPGA clock input pin.
CLKFB => CLKFB, -- 1-bit Feedback clock input to DCM. The
feedback
input is required unless the DFS
DLL_FREQUENCY_MODE => "LOW", -- AUTO mode allows DLL to do automatic frequency search
to decide
-- is used stand-alone. The source of CLKFB must be CLK0 or
-- whether DLL will operate in LOW or HIGH mode. This is a legacy
CLK2X output from the
-- attribute where the high and low value has no affect, it is
-- DCM.
-- always in auto mode.
CLKIN => CLKIN, -- 1-bit Clock input for the DCM.
DSS_MODE => "NONE",
DSSEN => DSSEN,
DUTY_CYCLE_CORRECTION => TRUE, -- Corrects the duty cycle of the CLK0, CLK90, CLK180, and CLK270
PSCLK => PSCLK, -- 1-bit Phase shift clock input. The PSCLK
-- outputs.
input pin provides the source clock for
PHASE_SHIFT => 0, -- Defines the amount of fixed phase shift from -255 to 255
-- the DCM phase shift.
STARTUP_WAIT => FALSE -- Delays configuration DONE until DCM LOCK.
PSEN => PSEN, -- 1-bit Variable Phase Shift enable signal,
synchronous with PSCLK.
)
Using the DLL to De-Skew the
Clock
Three Types of Clock Resources
I/O Column
Global
Clocks
Regional
Clocks
Global
Muxes
I/O
Clocks
BUFG - Global (Clock) Buffer
This design element is a high-fanout buffer that connects signals to the global routing
resources for low skew distribution of the signal. BUFGs are typically used on clock nets.
Library UNISIM;
use UNISIM.vcomponents.all;
-- BUFG: Global Clock Buffer
-- Virtex-6
-- Xilinx HDL Libraries Guide, version 11.2
BUFG_inst : BUFG
generic map (
)
port map (
O => O, -- 1-bit Clock buffer output
I => I -- 1-bit Clock buffer input
);
BUFGCE
This design element is a global clock buffer with a single gated
input. Its O output is "0" when clock enable (CE) is Low
(inactive). When clock enable (CE) is High, the I input is
transferred to the O output.
This module is race condition free.
XtremeDSP in Virtex-5
• Second-generation DSP
slice architecture
DSP Slice
– 25x18 multiplier
– Per-bit logic functions
(AND, OR, XOR,
XNOR,…)
• High performance for DSP
“heavy lifting”
– 550 MHz operation
Can also be used for fast counters, barrel shifters, etc…
Virtex-5 DSP48E
Full Custom Design Enabling Efficient DSP
Wider internal data-path
and 96-accumulated output
enable higher precision
Pipeline registers enable
550Mhz performance
ACOUT
BCOUT
PCOUT
ACIN
25x18 input increases
precision and efficiency
=
BCIN
Optional Pipeline Register/
Routing Logic
Optional Register
Routing Logic
C (48-bit)
Multiplier
B (18-bit)
A (25-bit)
Optional Pipeline Register/
Routing Logic
48-bit
P (48-bit)
Optional P(96-bit)
PCIN
Pattern detect circuitry
increases functionality
FPGAs For Massively Parallel DSP
FPGA - Fully Parallel Implementation
Programmable DSP - Sequential
X
C1
C0
X
X
C2
MAC Unit
640 operations
in 1 clock cycle
Reg
C3
Reg
+
C0
Reg
640 clock
cycles
needed
X
Reg
Coefficients
Data In
Reg
Data In
X …C192
+
Data Out
Data Out
1 GHz
640 clock cycles
= 1.6 MSPS
550 MHz
1 clock cycle
= 550 MSPS
640-tap filter implementation is 340 times faster
X
Xilinx, 7 Series Families
Zynq = FPGA + Processor