Loading AI tools
Network protocol From Wikipedia, the free encyclopedia
RDMA over Converged Ethernet (RoCE)[1] is a network protocol which allows remote direct memory access (RDMA) over an Ethernet network. There are multiple RoCE versions. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed. Although the RoCE protocol benefits from the characteristics of a converged Ethernet network, the protocol can also be used on a traditional or non-converged Ethernet network.[2][3][4][5]
Network-intensive applications like networked storage or cluster computing need a network infrastructure with a high bandwidth and low latency. The advantages of RDMA over other network application programming interfaces such as Berkeley sockets are lower latency, lower CPU load and higher bandwidth.[6] The RoCE protocol allows lower latencies than its predecessor, the iWARP protocol.[7] There are RoCE HCAs (Host Channel Adapter) with a latency as low as 1.3 microseconds[8][9] while the lowest known iWARP HCA latency in 2011 was 3 microseconds.[10]
The RoCE v1 protocol is an Ethernet link layer protocol with Ethertype 0x8915.[2] This means that the frame length limits of the Ethernet protocol apply: 1500 bytes for a regular Ethernet frame and 9000 bytes for a jumbo frame.
The RoCE v1.5 is an uncommon, experimental, non-standardized protocol that is based on the IP protocol. RoCE v1.5 uses the IP protocol field to differentiate its traffic from other IP protocols such as TCP and UDP. The value used for the protocol number is unspecified and is left to the deployment to select.
The RoCE v2 protocol exists on top of either the UDP/IPv4 or the UDP/IPv6 protocol.[3] The UDP destination port number 4791 has been reserved for RoCE v2.[11] Since RoCEv2 packets are routable the RoCE v2 protocol is sometimes called Routable RoCE[12] or RRoCE.[4] Although in general the delivery order of UDP packets is not guaranteed, the RoCEv2 specification requires that packets with the same UDP source port and the same destination address must not be reordered.[4] In addition, RoCEv2 defines a congestion control mechanism that uses the IP ECN bits for marking and CNP[13] frames for the acknowledgment notification.[14] Software support for RoCE v2 is still emerging[when?]. Mellanox OFED 2.3 or later has RoCE v2 support and also Linux Kernel v4.5.[15]
RoCE defines how to perform RDMA over Ethernet while the InfiniBand architecture specification defines how to perform RDMA over an InfiniBand network. RoCE was expected to bring InfiniBand applications, which are predominantly based on clusters, onto a common Ethernet converged fabric.[16] Others expected that InfiniBand will keep offering a higher bandwidth and lower latency than what is possible over Ethernet.[17]
The technical differences between the RoCE and InfiniBand protocols are:
While the RoCE protocols define how to perform RDMA using Ethernet and UDP/IP frames, the iWARP protocol defines how to perform RDMA over a connection-oriented transport like the Transmission Control Protocol (TCP). RoCE v1 is limited to a single Ethernet broadcast domain. RoCE v2 and iWARP packets are routable. The memory requirements of a large number of connections along with TCP's flow and reliability controls lead to scalability and performance issues when using iWARP in large-scale datacenters and for large-scale applications (i.e., large-scale enterprises, cloud computing, web 2.0 applications etc.[21]). Also, multicast is defined in the RoCE specification while the current iWARP specification does not define how to perform multicast RDMA.[22][23][24]
Reliability in iWARP is given by the protocol itself, as TCP is reliable. RoCEv2 on the other hand utilizes UDP which has a far smaller overhead and better performance but does not provide inherent reliability, and therefore reliability must be implemented alongside RoCEv2. One solution is to use converged Ethernet switches to make the local area network reliable. This requires converged Ethernet support on all the switches in the local area network and prevents RoCEv2 packets from traveling through a wide area network such as the internet which is not reliable. Another solution is to add reliability to the RoCE protocol (i.e., reliable RoCE) which adds handshaking to RoCE to provide reliability at the cost of performance.
The question of which protocol is better depends on the vendor. Chelsio recommends and exclusively support iWARP. Mellanox, Xilinx, and Broadcom recommend and exclusively support RoCE/RoCEv2. Intel initially supported iWARP but now supports both iWARP and RoCEv2.[25] Other vendors involved in the network industry provide support for both protocols such as Marvell, Microsoft, Linux and Kazan.[26] Cisco supports both RoCE[27] and their own VIC RDMA protocol.
Both Protocols are standardized with iWARP being the standard for RDMA over TCP defined by the IETF and RoCE being the standard for RDMA over Ethernet defined by the IBTA.[26]
Some aspects that could have been defined in the RoCE specification have been left out. These are:
In addition, any protocol running over IP cannot assume the underlying network has guaranteed ordering, any more than it can assume congestion cannot occur.
It is known that the use of PFC can lead to a network-wide deadlock.[32] [33] [34]
Some vendors of RoCE enabled equipment include:
Seamless Wikipedia browsing. On steroids.
Every time you click a link to Wikipedia, Wiktionary or Wikiquote in your browser's search results, it will show the modern Wikiwand interface.
Wikiwand extension is a five stars, simple, with minimum permission required to keep your browsing private, safe and transparent.