Reevaluating Online Superpage Promotion with Hardware Support

Download Report

Transcript Reevaluating Online Superpage Promotion with Hardware Support

Realizing the Performance Potential of the
Virtual Interface Architecture
Evan Speight, Hazim Abdel-Shafi, and John K.
Bennett
Rice University, Dep. Of Electrical and Computer
Engineering
Presented by Constantin Serban, R.U.
VIA Goals
• Communication infrastructure for System
Area Networks (SANs)
• Targets mainly high speed cluster
applications
• Efficiently harnesses the communication
performance of underlying networks
Trends
• The peak bandwidth increase two order of
magnitude over past decade while user
latency decreased modestly.
• The latency introduced by the protocol is
typically several times the latency of the
transport layer.
• The problem becomes acute especially for
small messages
Targets
VI architecture addresses the following issues:
• Decrease the latency especially for small
messages (used in synchronization)
• Increase the aggregate bandwidth (only a
fraction of the peak bandwidth is utilized)
• Reduce the CPU processing due to the
message overhead
Overhead
Overhead mainly comes from two sources:
• Every network access requires one-two
traps into the kernel
–
•
user/kernel mode switch is time consuming
Usually two data copies occur:
–
–
From the user buffer to the message passing
API
From message layer to the kernel buffer
VIA approach
• Remove the kernel from the critical path
– Moving communication code out of the kernel
into user space
• Provide 0-copy protocol
– Data is sent/received directly into the user
buffer, no message copy is performed
VIA emerged as a standardization effort from
Compaq, Intel, and Microsoft
It was built on several academic ideas:
• The main architecture most similar to U-Net
• Essential features derived from VMMC
Among current implementations :
– GigaNet cLan – VIA implemented in hardware
– Tandem ServerNet –VIA software driver
emulated
– Myricom Myrinet - software emulated in
firmware
VIA architecture
VIA operations
Set-Up/Tear-Down :
• VIA is point-to-point connection oriented protocol
• VI-endpoint : the core concept in VIA
•
•
•
•
•
Register/De-Register Memory
Connect/Disconnect
Transmit
Receive
RDMA
VIA operations
Set-Up/Tear-Down :VIA is point-to-point connection
oriented protocol
• VI-endpoint : the core concept in VIA
• VipCreateVi function creates a VI endpoint in the
user space.
• The user-level library passes the call to the kernel
agent which passes the creation information to the
NIC.
• OS thus controls the application access to the NIC
VIA operations - cont’d
Register/De-Register Memory:
• All data buffers and descriptors reside in a
registered memory
• NIC performs DMA I/O operation in this
registered memory
• Registration pins down the pages into the physical
memory and provides a handle to manipulate the
pages and transfer the addresses to the NIC
• It is performed once, usually at the beginning of
the communication session
VIA operations - cont’d
Connect/Disconnect:
• Before communication, each endpoint is
connected to a remote endpoint
• The connection is passed to the kernel agent
and down to the NIC
• VIA does not define any addressing scheme,
existing schemes can be used in various
implementations
VIA operations - cont’d
Transmit/receive:
• The sender builds a descriptor for the message to
be sent. The descriptor points to the actual data
buffer. Both descriptor and data buffer resides in a
registered memory area.
• The application then posts a doorbell to signal the
availability of the descriptor.The doorbell contains
the address of the descriptor.
• The doorbells are maintained in an internal queue
inside the NIC
VIA operations - cont’d
Transmit/receive (cont’d):
• Meanwhile, the receiver creates a descriptor that
points to an empty data buffer and posts a doorbell
in the receiver NIC queue
• When the doorbell in the sender queue has reached
the top of the queue, through a double indirection
the data is sent into the network.
• The first doorbell/ descriptor is picked up from the
receiver queue and the buffer is filled out with
data
VIA operations - cont’d
RDMA:
• As a mechanism derived from VMMC, VIA
allows Remote DMA operations:
RDMA Read and Write
• Each node allocates a receive buffer and registers
it with the NIC. Additional structures that contain
read and write pointers to the receive buffers are
exchanged during connection setu
• Each node can read and write to the remote node
address directly.
• These operations posts potential implementation
problems.
Evaluation Benchmarks
• Two VI implementations :
– GigaNet cLan B:125MB/sec, Latency 480ns
– Tandem ServerNet, 50MB/S, Latency 300ns
• Performance measured:
– Bandwidth and Latency
– Poling vs. Blocking
– CPU Utilization
Bandwidth
Latency
Latency Polling/Blocking
CPU utilization
MPI performance using VIA
• The challenge is to deliver performance to
distributed application
• Software layers such MPI are mostly used
between VIA and the application: provide
increased usability but they bring additional
overhead
• How to optimize this layer in order to use it
efficiently with VIA ?
MPI VIA - performance
MPI observations
• Difference between MPI-UDP and MPIVIA-baseline is remarkable
• MPI-VIA-baseline is dramatically far from
VIA-Native
• Several improvements proposed to shift
MPI-Via to be closer to VIA native : reduce
MPI overhead
MPI Improvements
• Eliminating unnecessary copies:
MPI UDP and VIA use a single set of receiving buffers,
thus data should be copied to the application : allow the
user to register any buffer
• Choosing a synchronization primitive:
All synchronization formerly using OS constructs/events.
Better implementation using swap processor commands
• No Acknowledge:
Remove the acknowledge of the message by switching to
a reliable VIA mode
VIA - Disadvantages
• Polling vs. blocking synchronization – a tradeoff
between CPU consumption and overhead
• Memory registration: locking large amount of
memory makes virtual memory mechanisms
inefficient. Registering / deregistering on the fly is
slow
• Point-to-point vs. multicast: VIA lacks multicast
primitives. Implementing multicast over the actual
mechanism, makes communication inefficient
Conclusion
• Small latency for small messages. Small
messages have a strong impact on
application behavior
• Significant improvement over UDP
communication (still after recent TCP/UDP
hardware implementations?)
• At the expense of an uncomfortable API