Software Magpie: April 2008

Wednesday, 30 April 2008

Bottled it

Shopping yesterday, I compared prices of bottled water at my local Sainsbury's (a supermarket chain in the United Kingdom). The Highland Spring was exceptionally troublesome. I took a snapshot with my mobile phone. I wonder what the shelf-fillers made of it.

Note: the suffix p is "pence" (plural of penny)

Sunday, 27 April 2008

Reading List

Here are a few things I've read recently, apart from papers and reports:

Press On: principles of interaction programming, by Harold Thimbleby: (devices often have frustrating user interfaces, but we can apply intelligent computer science informed by context to get much better ones, and he provides some programs to do it; although the book is not as beautiful as one of Edward Tufte's, there is a similar intellectual range and excitement, including relevant historical analogies, and the importance of context to any design)
Chris Okasaki's Publications (data structures and algorithms for functional programming languages)
Software Engineering for Internet Applications, which is online, although I bought the paperback (only 2006, and the technology is already outdated, but the book is still worthwhile for its extensive design discussions and revelations about the worrying low-level stuff underpinning web applications, and of course the technology is still in use)
Beyond the Desktop Metaphor (as it suggests, looking for alternatives to the over-present desktop, but I didn't find it had much fizz, compared to Press On, and although there were a few ideas worth applying, I thought most of the results were dull compared say to just using Acme or an iPhone)

Sunday, 6 April 2008

A Fistful of Protocols

With my able assistants, clever Google and slightly less reliable Memory (with a "y"), I considered existing protocols designed for high-speed, highly-reliable networks, including URP (indirectly), NETBLT, LNTP, VMTP, SNR, and XTP, roughly in that order; and read previous reviews of them. I also looked at Plan 9's early Nonet and later IL/IP protocols, RFC1077, SCP, SCMP, and a few "one-sided communication" protocols. (I haven't got a specification for Nonet but most of the code is online.) The MPI specification defines a programming interface not a protocol, and so is not directly useful but there are papers about using one-sided protocols for MPI implementation.

Most of these protocols were designed to include high-speed wide-area internets, except Nonet and IBM's one-sided protocol with no name, which I shall call Clint. Clint was designed specifically for the Blue Gene environment. As noted earlier, that environment does not eliminate failures such as loss or data corruption, but greatly reduces their likelihood. For instance, the network is local area only, will not impose arbitrary delays on packets, and will not drop packets because some intervening node is suddenly congested (as might happen on the Internet). It might make a mistake, but rarely. That should allow us to make a few simplifying assumptions, but our project's requirements for fault-tolerance still require us to account for uncommon errors.

The Clint paper shows that per-packet header overhead matters, because the payload is small: 248 bytes on the Torus, after allowing for an 8-byte header read by the hardware. The hardware likes data on 16-byte boundaries, so an extra 8-byte sofware header fits well. That is too small for a complete message context, so Clint transmits a larger header containing a full context in every initial message until the first response from the receiver provides its own unique identifier for that context, reducing the header to the near minimal 8 bytes in all subsequent messages. The scheme saves the latency of a round trip to establish that context first, and avoids wasting bandwidth on big headers in all packets.

What do we need from our transport protocol? We certainly need virtual circuits for 9P connections, and reliable messaging for MPI-like things. Of the existing protocols, XTP looked the most promising: it offers fast setup of contexts for message sending, including context key exchange (with a similar effect to Clint). It follows URP in making the receiver reactive, which allows it to support both datagrams and connections in the same structure by having a transmitter drive the receiver appropriately. The receiver state is transmitted only on demand, and promptly, removing the need for most timers at the receiver.

I am currently implementing a simplified variant of XTP, with a nod or two to Nonet. The packet formats are different, with a view towards a more subtle implementation along x-kernel lines. Some aspects of XTP are replaced by other mechanisms. I think rate-based control is better done at the application level, for instance, and we might do something a little different for multicast, given the special forms of multicast supported by the Tree and Torus. Timers can be rather sloppy and statically-defined, because errors will typically arise only because a node crashes or hangs, or someone is careless with the wire cutters. Out-of-order delivery will occur on the Torus and must not cause needless retransmissions, but how many packet frames might elapse before an obviously missing packet is declared lost? Apart from that, the reliability of the networks allows an early driver to assume error-free networks, with later support for retransmission and other error recovery.

Saturday, 5 April 2008

Protocols and Plan 9 on Blue Gene

A small group of us is developing new small-scale systems software for large-scale supercomputers, such as the IBM Blue Gene series, based on the distributed operating system Plan 9 from Bell Labs.

Blue Gene has groups of up to 64 CPU nodes, each with several processors, connected to the outside world through an IO node per group. The network provision is unusual (although that itself is not unusual in the supercomputer world). Only the IO node has a conventional Ethernet. The CPU nodes are typically connected in a 2D or 3D structure by a special Torus network. CPU nodes within each group are connected to each other and to the IO node for the group by a "class routed" network, commonly called the Tree, although routing tables can create other topologies.

The Ethernet is nothing special and we simply use the existing Internet protocols. Just to get started, we also run IP over the Tree and the Torus, with small MTUs. Given Plan 9's straightforward Medium structure in its Internet stack, it was short work to add Tree and Torus medium drivers. A kernel process reads the raw device (/dev/torus or /dev/vc0), strips the medium header, and passes the resulting Block up the stack; in the other direction, the medium driver adds the tree or torus driver's header to an IP packet and writes it to the device. A few lines of shell script configure the new media into the IP subsystem.

Historically, attempts to provide improved replacements for TCP have failed, partly because new protocols are hard to deploy, and partly because some suggested improvements have simply been added as extensions to TCP or its implementations. Even so, the Tree and the Torus have properties that make it attractive to consider other transport protocols. Both networks have small payloads per packet (240 or 256 bytes), high speed, high reliability (low error rate and retransmission in hardware), and automatic flow control. (The Tree delivers packets in order, but the Torus does not.) Each network provides its own form of multicast, different from IP multicast. The Tree can do certain reduction operations in the network (combining results up the tree).

Existing applications commonly use the Message Passing Interface (MPI) to exchange data and control computations. Much of it is supported at the library level, but we can make good use of suitable primitives at the transport level. We also hope to provide an alternative native messaging interface simpler than MPI's. Our system applications will be using Plan 9's 9P protocol extensively to represent and share services and resources. It seems we shall need both connectionless and connection-oriented communication.

The supercomputer environment makes large-scale experiments in protocol design and implementation easier than on a normal LAN or the Internet. For instance, there are hundreds of nodes, but we can readily ensure that all of them are running the same implementation of the same version of the protocol, and we can change the protocol(s) without too much fuss. Being able to inter-operate directly with other systems is not a key requirement (working through a protocol bridge is fine for now).

Despite the hardware flow control at the packet level, we still need flow control in the transport protocol: network access is multiplexed, and producer and consumer rates are not always matched. (Rate control in the protocol can be avoided by applying it at a higher level in the system.) On the other hand, the networks' highly reliable delivery might reduce and simplify error recovery. Unfortunately the error rate is not zero, so we must still allow for them (at some level, for many applications).

For several decades, protocols have been designed and sometimes implemented, to suit high-speed, highly-reliable networks. Perhaps something suitable exists, off-the-peg, or even off-the-wall. In the next post, I shall mention a few of them and sketch what I am currently implementing.

Software Magpie