RDMA's Long Hard Road

As network transport speeds increase, new software architecture is required to remove data transmission bottlenecks.

Jul 1, 2005 - By Art Wittmann

Promise: RDMA, along with TCP offload, will reduce latency, improve server throughput, and substantially reduce CPU load associated with processing network traffic.

Players: Building on work done for InfiniBand, the technology is being championed by Microsoft, Intel, IBM, HP, and a bunch of start-ups. Standards and implementations are being drawn up by the IETF, the Open Group, and the Linux community.

Prospects: RDMA's most obvious use is for speeding SAN and NAS access in the data center. Using RDMA requires that both the sender and the receiver be equipped with RDMA NICs, so the technology won't see significant use outside the data center for some time.

It's a funny thing about bottlenecks--removing one usually reveals another. We've already seen it as data transports move over 1Gbps, and the problem will only get worse when 10Gbps transports become commonplace. At these high transfer speeds, server performance in general, and memory and PCI bus bandwidth in particular, can be swamped by data from the network. A number of solutions have been developed to address the problem, but so far adoption has been slow.

Remote Direct Memory Access (RDMA) is the capstone of these technologies. Together with protocols such as TCP offload, it solves an important problem--namely, getting data into and out of an application's memory space as efficiently as possible. RDMA even has the right backers, including Microsoft, HP, Intel, and IBM, not to mention a bunch of start-ups. So why has it been so long in coming, and what can it do for you?

BOTTLENECKS

There are three distinct problems that slow down data transfers entering or leaving the host, and they all involve processing overhead. The first is the context transition from the application to the kernel and back each time new data arrives on a network port. The second is the protocol processing. Off-loading protocol processing from the CPU can save a good bit of CPU resources, as does avoiding the context transition typically required to let the kernel oversee the receipt of new data. Both problems are addressed by TCP Offload Engines (TOEs), NICs that have the smarts on board to handle all or most of the processing of TCP/IP traffic.

That leaves the third overhead condition encountered by most high-speed transports today, and that's the resources required to copy data buffers from NIC to kernel memory and then from kernel memory to application memory. For small data transfers, memory-to-memory copies aren't an issue, but as data transfers become large, they can account for the majority of the overhead involved in transferring the data. Each data transfer must cross the memory bus twice--once to read, and once to write. Memory transfer speeds aren't much faster than current network transfer speeds, so a couple of trips across the bus adds significantly to the total transfer time. The problem is particularly acute for servers with 10Gbps transports.

RNIC IS YOUR NIC

RDMA seeks to eliminate the overhead of large data transfers. An RDMA NIC (RNIC) can move received data directly into an application's memory space. The OS is bypassed completely, or is only invoked to set up the exchange and then notified when the task is finished. Chipmaker Broadcom estimates that RNICs using its RMDA chips can reduce processor overhead by 30 to 60 percent for large transfers.

The RDMA Consortium has developed specifications so that the technology can be implemented without modifying the applications. However, RDMA isn't wire-compatible with protocols that it supersedes, so in order to work, both ends of the connection must be RDMA-aware. There are and have been RDMA implementations available for a while now. Microsoft and Linux both offer the Consortium's Sockets Direct Protocol (SDP), which allows applications that use standard TCP socket protocols to use RDMA-capable sockets.

At the Transport layer, these are protocols that were originally developed for InfiniBand, but have been standardized for TCP/IP under the name iWARP. The IETF now has the work of the RDMA Consortium and is in the process of ratifying the RFC.

TROUBLE IN PARADISE

But while the Transport-layer protocols are almost solidified (solid enough for vendors to create products and test interoperability), there's still work to do within the OS. The Linux community is working on an OpenRDMA specification that will expose RDMA services to applications (see figure at right), but that specification hasn't yet been accepted by the keepers of the Linux kernel.

There are other challenges facing RDMA on Linux as well. Independent hardware vendors are reluctant to put the source code for their TCP offload and RDMA technology into the public domain as is normally required by the Linux community. This means a negotiation where the two sides agree to some open-source interface code, but typically not the code that runs on the card itself.

Microsoft had a setback of its own this spring when RNIC maker Alacritech sued the company claiming infringement of its patents on dynamic TCP offload. In April, a U.S. District Court granted a preliminary injunction against Microsoft, which effectively barred the company from putting technology it calls Chimney into its OS. Microsoft's Chimney architecture allows for some protocol functions to be off-loaded to dedicated NIC hardware, while others are maintained in the OS.

WHAT'S THE USE?

That essentially leaves two primary applications for RDMA in the short run. The first is application-specific offload, such as that proposed for iSCSI. The RDMA Consortium has developed the iSCSI Extensions for RDMA (iSER) protocol, which allows for the separation of data and control commands. Control is still handled at the kernel level, but data movement is done using RDMA. This is a better and more flexible approach than using iSCSI-offload NICs, which attempt to take on all the iSCSI processing tasks. The protocol will be a boon in the data center, but is unlikely to be widely used elsewhere for some time to come.

The other primary use is in High Performance Computing (HPC) clusters. Compared to the storage market, this is a tiny but quickly growing data center application. Proprietary hardware and InfiniBand have been the typical transports of choice in these environments, but iWARP and the faster incarnations of Ethernet are now broadening HPC choices. In this environment, applications are written to be aware of the cluster and will directly interact with kernel bypass protocols. Here, the Interconnect Software Consortium sets the standards under the umbrella of the Open Group.

Developing HPC applications is a rarified science, and chances are if you don't currently have an HPC application, you don't need or want one. That said, the move to industry-standard hardware could widen the appeal of financial modeling applications and scientific applications. The other obvious adopters of this sort of technology are database vendors, though native Linux and Windows support will be critical to adoption in this case.

xianfengdesign

發佈了8 篇原創文章 · 獲贊 7 · 訪問量 16萬+

私信關注

RDMA's Long Hard Road

宏中#和##的用法

羽毛球場（番禺市橋周邊）

跟我一起寫 Makefile

GCC使用手冊及常用命令行

基於Linux操作系統核心的漢字顯示

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結