Our ‘recv1000.c’ driver Implementing a ‘packet-receive’ capability with the Intel 82573L network interface controller.
Download ReportTranscript Our ‘recv1000.c’ driver Implementing a ‘packet-receive’ capability with the Intel 82573L network interface controller.
Our ‘recv1000.c’ driver
Implementing a ‘packet-receive’ capability with the Intel 82573L network interface controller
Similarities
• There exist quite a few similarities between implementing the ‘transmit-capability’ and the ‘receive-capability’ in a device-driver for Intel’s 82573L ethernet controller: – Identical device-discovery and ioremap steps – Same steps for ‘global reset’ of the hardware – Comparable data-structure initializations – Parallel setups for the TX and RX registers • But there also are a few fundamental differences (such as ‘active’ versus ‘passive’ roles for driver)
Host memory transmit packet buffer
‘push’ versus ‘pull’
Ethernet controller
push
transmit-FIFO receive packet buffer
pull
receive-FIFO
to/from LAN
The ‘write()’ routine in our ‘xmit1000.c’ driver could transfer data at any time, but the ‘read()’ routine in our ‘recv1000.c’ driver has to wait for data to arrive.
So to avoid doing any wasteful busy waiting, our ‘recv1000.c’ driver can use the Linux kernel’s sleep/wakeup mechanism – if it enables NIC’s interrupts!
Sleep/wakeup
• We will need to employ a wait-queue, we will need to enable device-interrupts, and we will need to write and install the code for an interrupt service routine (ISR) • So our ‘recv1000.c’ driver will have a few additional code and data components that were absent in our ‘xmit1000.c’ driver
Driver’s components
wait_queue_head
my_isr()
This function will awaken any sleeping reader-task
my_fops read
‘struct’ holds one function-pointer
my_read()
This function will program the actual data-transfer
my_get_info()
This function will allow us to inspect the receive-descriptors
module_init()
This function will detect and configure the hardware, define page-mappings, allocate and initialize the descriptors, install our ISR and enable interrupts, start the ‘receive’ engine, create the pseudo file and register ‘my_fops’
module_exit()
This function will do needed ‘cleanup’ when it’s time to unload our driver – turn off the ‘receive’ engine, disable interrupts and remove our ISR, free memory, delete page-table entries, the pseudo file, and the ‘my_fops’
How NIC’s interrupts work
• There are four interrupt-related registers which are essential for us to understand
0x00C0 0x00C8 0x00D0 0x00D8
ICR ICS IMS IMC Interrupt Cause Read Interrupt Cause Set Interrupt Mask Set/Read Interrupt Mask Clear
Interrupt event-types
31 30 18 17 16 15 14
reserved reserved
10 9 8 7 6 5 4 2 1 0
31: INT_ASSERTED (1=yes,0=no) 17: ACK (Rx-ACK Frame detected) 16: SRPD (Small Rx-Packet detected) 15: TXD_LOW (Tx-Descr Low Thresh hit) 9: MDAC (MDI/O Access Completed) 7: RXT0 ( Receiver Timer expired) 6: RXO (Receiver Overrun) 4: RXDMT0 (Rx-Desc Min Thresh hit) 2: LSC (Link Status Change) 1: TXQE( Transmit Queue Empty) 0: TXDW (Transmit Descriptor Written Back)
82573L
Interrupt Mask Set/Read
• This register is used to enable a selection of the device’s interrupts which the driver will be prepared to recognize and handle • A particular interrupt becomes ‘enabled’ if software writes a ‘1’ to the corresponding bit of this
Interrupt Mask Set
register • Writing ‘0’ to any register-bit has no effect, so interrupts can be enabled one-at-a-time
Interrupt Mask Clear
• Your driver can discover which interrupts have been enabled by reading IMS by writing to that register – but your driver cannot ‘disable’ any interrupts • Instead a specific interrupt can be disabled by writing a ‘1’ to the corresponding bit in the
Interrupt Mask Clear
register • Writing ‘0’ to a register-bit has no effect on the interrupt controller’s Interrupt Mask
Interrupt Cause Read
• Whenever interrupts occur, your driver’s interrupt service routine can discover the specific conditions that triggered them if it reads the
Interrupt Cause Read
register • In this case your driver can clear any selection of these bits (except bit #31) by writing ‘1’s to them (writing ‘0’s to this register will have no effect) • If case no interrupt has occurred, reading this register may have the side-effect of clearing it
Interrupt Cause Set
• For testing your driver’s interrupt-handler, you can artificially trigger any particular combination of interrupts by writing ‘1’s into the corresponding register-bits of this
Interrupt Cause Set
register (assuming your combination of bits corresponds to interrupts that are ‘enabled’ by ‘1’s being present for them in the Interrupt Mask)
Our interrupt-handler
• We decided to enable all possible causes (and we ‘log’ them via ‘printk()’ messages we’ve omitted in the code-fragment here): } { irqreturn_t my_isr( int irq, void *dev_id ) int intr_cause = ioread32( io + E1000_ICR ); if ( intr_cause == 0 ) return IRQ_NONE; wake_up_interruptible( &wq_rd ); iowrite32( intr_cause, io + E1000_ICR ); return IRQ_HANDLED;
We ‘tweak’ our packet-format
• Our ‘xmit1000.c’ driver elected to have the NIC append ‘padding’ to any short packets • But this prevents a receiver from knowing how many bytes represent actual data • To solve this problem, we added our own ‘
count
’ field to each packet’s payload 0 6 12 14 destination MAC-address source MAC-address
Type/Len
count actual bytes of user-data
Our ‘read()’ method
} { ssize_t my_read( struct file *file, char *buf, size_t len, loff_t *pos ) static int rxhead = 0; // to remember where we left off unsigned char unsigned int *from = phys_to_virt( rxdesc[ rxhead ].base_addr ); count; // go to sleep if no new data-packets have been received yet if ( ioread32( io + E1000_RDH ) == rxhead ) if ( wait_event_interruptible( wq_rd, ioread32( io + E1000_RDH ) != rxhead ) ) return –EINTR; // get the number of actual data-bytes in the new (possibly padded) data-packet count = *(unsigned short*)(from + 14); // data count as stored by ‘xmit1000.c’ if ( count > len ) count = len; // can’t transfer more bytes than buffer can hold if ( copy_to_user( buf, from+16, count ) ) return –EFAULT; // advance our static array-index variable to the next receive-descriptor rxhead = (1 + rxhead) % 8; // this index wraps-around after 8 descriptors return count; // tell kernel how many bytes were transferred
Hardware’s initialization
• We allocate and initialize a minimum-size Receive Descriptor Queue (8 descriptors) • We perform a ‘global reset’ via the RST-bit in the NIC’s Device Control register (with a side-effect of zeroing both RDH and RDT) • We configure the ‘receive’ engine (RCTL) plus a few additional registers that affect the network controller’s reception-options (namely: RXCSUM, RFCTL, PSRCTL)
Receive Control (0x0100)
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
R
=0
0
SE CRC BSEX
R
=0 PMCF DPF
R
=0 CFI CFI EN VFE BSIZE B A M 15 14 13 12 11 10 9 8 7 6 5 4 3 2
R
=0 MO DTYP RDMTS I S L O S LBM L U LPE MPE UPE
0 0
E 1 0
R
=0 EN = Receive Enable SBP = Store Bad Packets UPE = Unicast Promiscuous Enable DTYP = Descriptor Type MO = Multicast Offset BAM = Broadcast Accept Mode DPF = Discard Pause Frames PMCF = Pass MAC Control Frames BSEX = Buffer Size Extension MPE = Multicast Promiscuous Enable BSIZE = Receive Buffer Size LPE = Long Packet reception Enable VFE = VLAN Filter Enable LBM = Loopback Mode RDMTS = Rx-Descriptor Minimum Threshold Size SECRC = Strip Ethernet CRC FLXBUF = Flexible Buffer size CFIEN = Canonical Form Indicator Enable CFI = Cannonical Form Indicator bit-value
Our driver initially will program this register with the value 0x0400801C. Then later, when everything is ready, it will turn on bit #1 to ‘start the receive engine’ 82573L
Packet-Split Rx Control (0x2170)
31 30 29 24 23 22 21 16 15 14 13
0 0 BSIZE3 (in KB) 0 0 BSIZE2 (in KB) 0 0 BSIZE1 (in KB)
8 7 6 0
0 BSIZE0 (in 1/8 KB)
If the controller is configured to use the packet-split feature (RCTL.DTYP=1), then this register controls the sizes of the four receive-buffers, so there are certain requirements that nonzero values appear in several of these fields.
But our ‘recv1000.c’ driver will use the ‘legacy’ receive-descriptor format (i.e., RCRL.DTYP=0) and so this register will be disregarded by the NIC and therefore we are allowed to program it with the value 0x00000000.
Receive Filter Control (0x5008)
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 PHY RST VME
R
=0 TFCE RFCE RST
R
=0 =0
R
=0
R
=0
R
=0 ADV D3 WUC
R
=0 D/UD status
R
=0 15 14 13 12 11 10 9 8 7 6 5 4 3 2
EXSTEN
IPFRSP _DIS ACKD _DIS ACK DIS IPv6 XSUM _DIS IPv6 _DIS NFS_VER NSFR _DIS NSFW _DIS
R
=0 GIO =0
R
=1 D 1 0 iSCSI _DIS
Our driver writes 0x00000000 to this register, which among other effects will cause the ethernet controller NOT to write Extended Status information into our device driver’s legacy-format Receive Descriptors (bit 15: EXTEN=0)
RX Checksum Control (0x5000)
31 10 9 8 7
reserved packet checksum start
0
TCP/UDP Checksum Off-load enabled (1=yes, 0=no) IP Checksum Off-load enabled (1=yes, 0=no) This field controls the starting byte for the Packet Checksum calculation
Our driver programs this register with the value 0x00000000 (which disables Checksum Off loading for TCP/UDP packets (which we won’t be receiving) and for IP packets (which likewise won’t be sent by our ‘xmit1000.c’ driver), and all Packet-Checksums will be calculated starting from the very first byte
Rx-Descriptor Control (0x2828)
0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17
0 0 0 0 0 0
G R A N
0 0
WTHRESH (Writeback Threshold) 16
0
15 14 13 12 11 10 9 8 7 6 5 4 3 2
0 0
FRC SPD
0
I L
0
S 0 0
0
A S D E
0
L T
0 0 0
1 0
“This register controls the fetching and write back of receive descriptors.
The three threshhold values are used to determine when descriptors are read from, and written to, host memory. Their values can be in units of cache lines or of descriptors (each descriptor is 16 bytes), based on the value of the GRAN bit (0=cache lines, 1=descriptors). When GRAN = 1, all descriptors are written back (even if not requested).” --Intel manual
Recommended for 82573: 0x01010000 (GRAN=1, WTHRESH=1)
Maximum-size buffers
• We use a minimal number of maximum size receive-buffers (eight of 1536-bytes) buffer 7 buffer 6 buffer 5 buffer 4 buffer 3 buffer 2 buffer 1 buffer 0
kernel memory
ring of eight rx-descriptors
NIC “owns” our rx-descriptors
RDBAH/RDBAL RDLEN =0x80
0 1 2 3 4 5 6 7 8
Our ‘static’ variable
rxhead descriptor 0 descriptor 1 descriptor 2 descriptor 3 descriptor 4 descriptor 5 descriptor 6 descriptor 7 descriptor 8 RDH
This register gets initialized to 0, then gets changed by the controller as new packets are received
RDT
This register gets initialized to 8, then never gets changed
Driver ‘defects’
• If an application tries to ‘read’ from our device file ‘/dev/nic’, but the controller received a packet that contains more bytes of data than the user requested, excess bytes get “lost’ (i.e., discarded) • If an application delays reading packets while the controller continues receiving, then an earlier packet gets “overwritten”
In-class exercise #1
• Discuss with your nearest class-member your ideas for how these driver ‘defects’ might be overcome, so that packet-data being received will be protected against getting “lost” and/or being “overwritten”
In-class exercise #2
• Login to a pair of machines on the ‘anchor’ cluster and install our ‘xmit1000.ko’ and our ‘recv1000.ko’ modules (one on each) • Try transferring a textfile from one of the machines to the other, by using ‘cat’: anchor01$
cat
anchor02$
cat
textfile > /dev/nic /dev/nic > recv1000.out
• How large a textfile can you successfully transfer using our simple driver-modules?