Saturday 5 April 2008

Protocols and Plan 9 on Blue Gene

A small group of us is developing new small-scale systems software for large-scale supercomputers, such as the IBM Blue Gene series, based on the distributed operating system Plan 9 from Bell Labs.

Blue Gene has groups of up to 64 CPU nodes, each with several processors, connected to the outside world through an IO node per group. The network provision is unusual (although that itself is not unusual in the supercomputer world). Only the IO node has a conventional Ethernet. The CPU nodes are typically connected in a 2D or 3D structure by a special Torus network. CPU nodes within each group are connected to each other and to the IO node for the group by a "class routed" network, commonly called the Tree, although routing tables can create other topologies.

The Ethernet is nothing special and we simply use the existing Internet protocols. Just to get started, we also run IP over the Tree and the Torus, with small MTUs. Given Plan 9's straightforward Medium structure in its Internet stack, it was short work to add Tree and Torus medium drivers. A kernel process reads the raw device (/dev/torus or /dev/vc0), strips the medium header, and passes the resulting Block up the stack; in the other direction, the medium driver adds the tree or torus driver's header to an IP packet and writes it to the device. A few lines of shell script configure the new media into the IP subsystem.

Historically, attempts to provide improved replacements for TCP have failed, partly because new protocols are hard to deploy, and partly because some suggested improvements have simply been added as extensions to TCP or its implementations.
Even so, the Tree and the Torus have properties that make it attractive to consider other transport protocols. Both networks have small payloads per packet (240 or 256 bytes), high speed, high reliability (low error rate and retransmission in hardware), and automatic flow control. (The Tree delivers packets in order, but the Torus does not.) Each network provides its own form of multicast, different from IP multicast. The Tree can do certain reduction operations in the network (combining results up the tree).

Existing applications commonly use the Message Passing Interface (MPI) to exchange data and control computations. Much of it is supported at the library level, but we can make good use of suitable primitives at the transport level. We also hope to provide an alternative native messaging interface simpler than MPI's. Our system applications will be using Plan 9's 9P protocol extensively to represent and share services and resources. It seems we shall need both connectionless and connection-oriented communication.

The supercomputer environment makes large-scale experiments in protocol design and implementation easier than on a normal LAN or the Internet. For instance, there are hundreds of nodes, but we can readily ensure that all of them are running the same implementation of the same version of the protocol, and we can change the protocol(s) without too much fuss. Being able to inter-operate directly with other systems is not a key requirement (working through a protocol bridge is fine for now).

Despite the hardware flow control at the packet level, we still need flow control in the transport protocol:
network access is multiplexed, and producer and consumer rates are not always matched. (Rate control in the protocol can be avoided by applying it at a higher level in the system.) On the other hand, the networks' highly reliable delivery might reduce and simplify error recovery. Unfortunately the error rate is not zero, so we must still allow for them (at some level, for many applications).

For several decades, protocols have been designed and sometimes implemented, to suit high-speed, highly-reliable networks. Perhaps something suitable exists, off-the-peg, or even off-the-wall. In the next post, I shall mention a few of them and sketch what I am currently implementing.

No comments: