Sunday, 6 April 2008

A Fistful of Protocols

With my able assistants, clever Google and slightly less reliable Memory (with a "y"), I considered existing protocols designed for high-speed, highly-reliable networks, including URP (indirectly), NETBLT, LNTP, VMTP, SNR, and XTP, roughly in that order; and read previous reviews of them. I also looked at Plan 9's early Nonet and later IL/IP protocols, RFC1077, SCP, SCMP, and a few "one-sided communication" protocols. (I haven't got a specification for Nonet but most of the code is online.) The MPI specification defines a programming interface not a protocol, and so is not directly useful but there are papers about using one-sided protocols for MPI implementation.

Most of these protocols were designed to include high-speed wide-area internets, except Nonet and IBM's one-sided protocol with no name, which I shall call Clint. Clint was designed specifically for the Blue Gene environment. As noted earlier, that environment does not eliminate failures such as loss or data corruption, but greatly reduces their likelihood. For instance, the network is local area only, will not impose arbitrary delays on packets, and will not drop packets because some intervening node is suddenly congested (as might happen on the Internet). It might make a mistake, but rarely. That should allow us to make a few simplifying assumptions, but our project's requirements for fault-tolerance still require us to account for uncommon errors.

The Clint paper shows that per-packet header overhead matters, because the payload is small: 248 bytes on the Torus, after allowing for an 8-byte header read by the hardware. The hardware likes data on 16-byte boundaries, so an extra 8-byte sofware header fits well. That is too small for a complete message context, so Clint transmits a larger header containing a full context in every initial message until the first response from the receiver provides its own unique identifier for that context, reducing the header to the near minimal 8 bytes in all subsequent messages. The scheme saves the latency of a round trip to establish that context first, and avoids wasting bandwidth on big headers in all packets.

What do we need from our transport protocol? We certainly need virtual circuits for 9P connections, and reliable messaging for MPI-like things.
Of the existing protocols, XTP looked the most promising: it offers fast setup of contexts for message sending, including context key exchange (with a similar effect to Clint). It follows URP in making the receiver reactive, which allows it to support both datagrams and connections in the same structure by having a transmitter drive the receiver appropriately. The receiver state is transmitted only on demand, and promptly, removing the need for most timers at the receiver.

I am currently implementing a simplified variant of XTP, with a nod or two to Nonet. The packet formats are different, with a view towards a more subtle implementation along x-kernel lines. Some aspects of XTP are replaced by other mechanisms. I think rate-based control is better done at the application level, for instance, and we might do something a little different for multicast, given the special forms of multicast supported by the Tree and Torus. Timers can be rather sloppy and statically-defined, because errors will typically arise only because a node crashes or hangs, or someone is careless with the wire cutters. Out-of-order delivery will occur on the Torus and must not cause needless retransmissions, but how many packet frames might elapse before an obviously missing packet is declared lost? Apart from that, the reliability of the networks allows an early driver to assume error-free networks, with later support for retransmission and other error recovery.


No comments: