NetFPGA Project 1 4-Port Layer 2/3 Switch

Transcript NetFPGA Project 1 4-Port Layer 2/3 Switch

NetFPGA Project:
4-Port Layer 2/3 Switch
Ankur Singla ([email protected])
Gene Juknevicius ([email protected])
Agenda



NetFPGA Development Board
Project Introduction
Design Analysis






Bandwidth Analysis
Top Level Architecture
Data Path Design Overview
Control Path Design Overview
Verification and Synthesis Update
Conclusion
NetFPGA Development Board
Ethernet
MAC/PHY
CFPGA
SRAM
CFPGA Interface Logic
UFPGA
SRAM
Project Introduction







4 Port Layer-2/3 Output Queued Switch Design
Ethernet (Layer-2), IPv4, ICMP, and ARP
Programmable Routing Tables – Longest Prefix Match, Exact
Match
Register support for Switch Fwd On/Off, Statistics, Queue
Status, etc.
Layer-2 Broadcast, and limited Layer-3 Multicast support
Limited support for Access Control
Highly Modular Design for future expandability
Bandwidth Analysis

Available Data Bandwidth




Memory bandwidth: 32 bits * 25 MHz = 800 Mbits/sec
CFPGA to Ingress FIFO/Control Block bandwidth:
32 bits * 25 MHz / 4 = 200 Mbits/sec
Packet Queue to Egress bandwidth:
32 bits * 25 MHz / 4 = 200 Mbits/sec
Packet Processing Requirements





4 ports operating at 10 Mbits/sec => 40 Mbits/sec
Minimum size packet 64 Byte => 512 bits
512 bits / 40 Mbits/sec = 12.8 us
Internal clock is 25 MHz
12.8 us * 25 MHz = 320 clocks to process one packet
Top Level Architecture
To CFPGA Chip
CFPGA Interface Logic
Control
Block
Ingress FIFO
Controller
Switching
and
Routing
Engine
Data
Control
Data
Control
Data
Control
Main Arbiter
Memory
Controller
To SRAM
Data Flow Diagram




Output Queued Shared
Memory Switch
Round Robin Scheduling
Packet Processing Engine
provides L2/L3 functionality
Coarse Pipelined Arch. at the
Block Level
Data Ingress
from CFPGA
Ingress
FIFO
Controller
Control
Block
Forwarding
Engine
Data Egress
to CFPGA
Memory
Controller
Master Arbiter




Round Robin Scheduling
of service to Each Input
and Output
Interfaces Rest of the
Design with Control FPGA
Co-ordinates activities of
all high level blocks
Maintains Queue Status for
each Output
Port 3
Port 0
Round
Robin
Algorithm
Port 2
Port 3
Reset
Idle State
Port 1
Port 0
Round
Robin
Algorithm
Port 2
Packet Move
from: Ingress
to: Ingress FIFO
Packet Move
from: Queue Memory
to: Egress
Packet Move
from: Control Block
to: Egress
Packet Move
from: Ingress FIFO
to: Queue Memory
Packet Move
from: Ingress
to: Control Block
Port 1
Ingress FIFO Control Block

Interfaces three blocks
IDLE



Control FPGA
Forwarding Engine
Packet Buffer Controller
gnt_ci_sw
packetMoveDone
Dual Packet Memories for
coarse pipelining
eop_ci_ufpga_o
packetProcessingDone
& grant_sw_sram
Responsible for Packet
Replication for Broadcast
Move Packet
from CFPGA
CFPGA
Src Port
Forwarding
Length
CFPGA
Forwarding
Engine
Bank 0 Bank 1

Length
Packet Memory
(Pkt 0)
Move Packet
to SRAM

Forwarding
Master
Arbiter
Packet Memory
(Pkt 1)
Src Port
Packet Processing Engine Overview

Goals






Features – L3/L2/ICMP/ARP Processing
Performance Requirements – 78Kpps
Fit within 60% of Single User FPGA Block
Modularity / Scalability
Verification / Design Ease
Actual




Support for all required features + L2 broadcast, L3 multicast, LPM,
Statistics and Policing (coarse access control)
Performance Achieved – 234Kpps (worst case 69Kpps for ICMP echo
requests 1500bytes)
Requires only 12% of Single UFPGA resources
Highly Modular Design for design/verification/scalability ease
Pkt Processing Engine Block Diagram
From CFPGA
First Level
Parsing
L3 Processing
ICMP Processing
Packet
Memory0
Native
Packet
Forwarding Master State Machine
Packet
Memory1
ARP Processing
To Packet Buffer
L2 Processing
Statistics and
Policing
Forwarding Master State Machine
!gntForProcessing || softReset




Responsible for controlling
individual processing blocks
Request/Grant Scheme for future
expandability
Initiates a Request for Packet to
Ingress FIFO and then assigns to
responsible agents based on
packet contents
Replication of MSM to provide
more throughput
IDLE
gntForProcessing
!Completion
Initiate
First Level
Parsing
parsingDone
Statistics
Update
Processing
Selection?
Type == ipv4
!Completion
Type == ARP
L3
Processing
defaultL2Fwd
!Completion
ARP
Processing
Protocol Type==ICMP
Policing/
Drop
!Completion
ICMP
Processing
!Completion
L2
Forwarding
defaultL2Fwd
defaultL2Fwd
defaultL2Fwd
L3 Processing Engine

Parsing of the L3 Information:


Src/Dest Addr, Protocol Type, Checksum, Length, TTL
Longest Prefix Match Engine




Mask Bits to represent the prefix. Lookup Key is Dest Addr
Associated Info Table (AIT) Indexed using the entry hit
AIT provides Destination Port Map, Destination L2 Addr, Statistics Bucket Index
Request/Done scheme to allow for expandability (e.g. future m-way Trie
implementation project)

ICMP Support Engine Request (if Dest Addr is Routers IP Address + Protocol
Type is ICMP)

Total 85 cycles for Packet Processing with 80% of the cycles spent on Table
Lookup
If using 4-way trie, total processing time can be reduced to less than 30 cycles.
L2 Processing Engine

If there is any processing problems with ARP, ICMP, and/or L3, then L2 switching
is done

Exact Match Engine




Re-use of the LPM match engine but with Mask Bits set to all 1’s.
Associated Info Table (AIT) Indexed using the entry hit
AIT provides Destination Port Map, and Statistics Bucket Index
Request/Done scheme to allow for expandability (e.g. future Hash implementation
project)

Learning Engine removed because of Switch/Router Hardware Verification
problems (HP Switch bug)

Total 76 cycles for Packet Processing with over 80% of the cycles spent on Table
Lookup
If using Hashing Function, total processing time can be reduced to less than 20
cycles.
Packet Buffer Interface


Interfaces with Master
Arbiter and Forward
Engine
Output Queued Switch



Statically Assigned
Single Queue per port
Off-chip ZBT SRAM
on NetFPGA board
Reset
Packet Write Into
Queue Memory
Idle State
Packet Read from
Queue Memory
Packet Queue Memory Organization
256K x 36 bits SRAM Device
4 Static Queues
128 packets per queue
2 KBytes per packet
Control Block

Typical Register Rd/Wr
Functionality







Status Register
Control Register (forwarding
disable, reset)
Router’s IP Addresses
(port 1-4)
Queue Size Registers
Statistics Registers
Layer-2 Table Programming
Registers
Layer-3 Table Programming
Registers
Reset
Idle State
Packet Reception
Packet Transmission
Packet Parcing
Packet Processing
Read / Write
Verification

Three Levels of Verification Performed

Simulations:



Hardware Verification




Module Level – to verify the module design intent and bus functional
model
System Level – using the NetFPGA verification environment for packet
level simulations
Ported System Level tests to create tcpdump files for NetFPGA traffic
server
Very good success on Hardware with all System Level tests
passing.
Only one modification required (reset generation) after
Hardware Porting
Demo - Greg can provide lab access to anyone interested
Synthesis Overview





Design was ported to
Altera EP20K400 Device
Logic Elements Utilized –
5833 (35% of Total LEs)
RAM ESBs Used – 46848
(21% of Total ESBs)
Max Design Clock
Frequency ~ 31MHz
No Timing Violations
Design Block
Name
Flip-flops
(Actual)
Ram
bits
(Actual)
Main Arbiter
71
0
1500
Memory Controller
109
0
2000
Control Block
608
0
5000
Ingress FIFO
Controller
60
64000
1200
Switching and
Routing Engine
925
14000
14000
Total
1773
78000
23700
Gates
(Actual)
Conclusion




Easy to achieve “required” performance in an OQ
Shared Memory Switch in NetFPGA
Modularity of the design allows more interesting and
challenging future projects
Design/Verification Environment was essential to
meet schedule
NetFPGA is an excellent design exploration platform