Blue Gene has groups of up to 64 CPU nodes, each with several processors, connected to the outside world through an IO node per group. The network provision is unusual (although that itself is not unusual in the supercomputer world). Only the IO node has a conventional Ethernet. The CPU nodes are typically connected in a 2D or 3D structure by a special Torus network. CPU nodes within each group are connected to each other and to the IO node for the group by a "class routed" network, commonly called the Tree, although routing tables can create other topologies.
The Ethernet is nothing special and we simply use the existing Internet protocols. Just to get started, we also run IP over the Tree and the Torus, with small MTUs. Given Plan 9's straightforward Medium structure in its Internet stack, it was short work to add Tree and Torus medium drivers. A kernel process reads the raw device (/dev/torus or /dev/vc0), strips the medium header, and passes the resulting Block up the stack; in the other direction, the medium driver adds the tree or torus driver's header to an IP packet and writes it to the device. A few lines of shell script configure the new media into the IP subsystem.
Historically, attempts to provide improved replacements for TCP have failed, partly because new protocols are hard to deploy, and partly because some suggested improvements have simply been added as extensions to TCP or its implementations. Even so, the Tree and the Torus have properties that make it attractive to consider other transport protocols. Both networks have small payloads per packet (240 or 256 bytes), high speed, high reliability (low error rate and retransmission in hardware), and automatic flow control. (The Tree delivers packets in order, but the Torus does not.) Each network provides its own form of multicast, different from IP multicast. The Tree can do certain reduction operations in the network (combining results up the tree).
Despite the hardware flow control at the packet level, we still need flow control in the transport protocol: network access is multiplexed, and producer and consumer rates are not always matched. (Rate control in the protocol can be avoided by applying it at a higher level in the system.) On the other hand, the networks' highly reliable delivery might reduce and simplify error recovery. Unfortunately the error rate is not zero, so we must still allow for them (at some level, for many applications).
For several decades, protocols have been designed and sometimes implemented, to suit high-speed, highly-reliable networks. Perhaps something suitable exists, off-the-peg, or even off-the-wall. In the next post, I shall mention a few of them and sketch what I am currently implementing.