Cooperative threads for C/C++

Event-based programming is a PITA. In C/C++, there are a number of libraries that provide cooperative threads (a.k.a. fibers or lightweight threads or…), which can make single-kernel-threaded asynchronous programming less painful.

I’ve been using the solid, fast, and simple State Threads library. I haven’t used Pth, nor do I have a good sense of the performance difference, but that appears to be the only other option worth considering. There exist other libraries, and you can find links to most of them from the Pth links page, but they aren’t as useful out of the box. To mention a few:

  • libcoro, libcoroutine, and libpcl (which supersedes coro) don’t provide IO, only basic coroutine primitives, and don’t appear to be widley used or supported.
  • libtask provides IO, but the implementation uses the slower ucontext.h primitives, the IO functions aren’t very rich (they don’t appear to offer timeouts), and again the library doesn’t appear to be widely used or supported.

Pitfalls: tooling, composition, memory

Using something like this in your next project is always a bit of a risk. Your code will almost certainly be more natural to express and easier to maintain, but you may encounter incompatibilities with other systems/libraries/tools. For instance, one major issue is that ST is (or at least has been at some point in the past) incompatible with system threads. I haven’t stress-tested this, but mixing the two hasn’t produced any problems for me yet. (The problem, if you’re curious, appears to be due to both pthreads and ST attempting to use the same location to store thread-local data.) Other tools that I’ve found to not play nicely with ST include Valgrind and various CPU profilers (gprof, Google perftools, etc.). gdb works fine, though.

A downside to using cooperative threads instead of events is that you can’t compose together multiple asynchronous operations without using multiple stacks. E.g., if I wanted to listen for the first of n sockets to yield something to read, and I wanted to do this using only the provided primitive IO functions, I’d need to perform a read from each of n threads (and have the winner send interrupts to the losers). To do this more efficiently, I’d need to write my own IO functions, ones that cannot be expressed using the threaded abstraction, but must instead fall back to raw asynchronous operations.

In practice, I haven’t had to do much arbitrary composition. Composing IO operations with sleeps (in other words, adding timeouts to IO operations) is important, but that tends to be built in to the IO functions already provided by the threading library. Composing together inter-thread signals and IO operations is also important (e.g. for interrupts), but interrupts also tend to be already provided by the library.

Another downside is that events may be more scalable/flexible in terms of memory consumption—this is one of the main reasons why (system) threads are regarded as generally less scalable than events. The stack requires contiguous memory, whereas event state objects are dynamically heap-allocated, so you needn’t to worry about providing enough room in advance for each “stack.” Conservative over-estimates occupy more space than necessary, so when juggling hundreds of thousands of connections, you may exhaust main memory and incur excessive swapping (thrashing).

Variations: Tamer, Capriccio

Tamer is a preprocessor that lets you annotate functions that are expected to perform asynchronous operations as “tamed”. You also annotate which “stack” variables are used across these calls. Tamer transforms these tamed functions into objects with the appropriate callbacks, and it changes the “stack” variables into fields so that they persist across callbacks.

The Tamer approach lets you compose events without the stack expense, but every time you enter/exit a tamed function, a state object is dynamically allocated/freed. For another approach, Capriccio is a system from Berkeley that introduces a simple idea: use linked stacks, or dynamically growable stacks. I hope we eventually see this concept used in a maintained system that actually builds and is usable on modern platforms.

Also, Tamer’s annotation requirement is both a blessing and a curse: annotations clearly identify at what parts of your program you may be preempted, but of course this reduces flexibility, since you can’t (say) embed an asynchronous operation into certain places, such as inside an streambuf::underflow() (this is the method that pulls more data from the streambuf’s underlying stream source).

In an ideal world, you won’t need to annotate functions, but the fact that a function is asynchronous should be inferred by tools (starting with a root set of primitive asynchronous operations), so that your IDE can highlight which functions are asynchronous. This grants you the freedom to change what’s asynchronous and what isn’t without propagating lots of annotation changes throughout your code, but still remain cognizant of where you might be preempted and where you definitely won’t. This would only work given enough static assumptions, though—hiding asynchronous IO operations behind any form of dynamic dispatch/polymorphism would preclude an analysis that is both accurate and useful.

Follow me on Twitter for stuff far more interesting than what I blog.