Monthly Archives: March 2009

Cooperative threads for C/C++

Event-based programming is a PITA. In C/C++, there are a number of libraries that provide cooperative threads (a.k.a. fibers or lightweight threads or…), which can make single-kernel-threaded asynchronous programming less painful.

I’ve been using the solid, fast, and simple State Threads library. I haven’t used Pth, nor do I have a good sense of the performance difference, but that appears to be the only other option worth considering. There exist other libraries, and you can find links to most of them from the Pth links page, but they aren’t as useful out of the box. To mention a few:

  • libcoro, libcoroutine, and libpcl (which supersedes coro) don’t provide IO, only basic coroutine primitives, and don’t appear to be widley used or supported.
  • libtask provides IO, but the implementation uses the slower ucontext.h primitives, the IO functions aren’t very rich (they don’t appear to offer timeouts), and again the library doesn’t appear to be widely used or supported.

Pitfalls: tooling, composition, memory

Using something like this in your next project is always a bit of a risk. Your code will almost certainly be more natural to express and easier to maintain, but you may encounter incompatibilities with other systems/libraries/tools. For instance, one major issue is that ST is (or at least has been at some point in the past) incompatible with system threads. I haven’t stress-tested this, but mixing the two hasn’t produced any problems for me yet. (The problem, if you’re curious, appears to be due to both pthreads and ST attempting to use the same location to store thread-local data.) Other tools that I’ve found to not play nicely with ST include Valgrind and various CPU profilers (gprof, Google perftools, etc.). gdb works fine, though.

A downside to using cooperative threads instead of events is that you can’t compose together multiple asynchronous operations without using multiple stacks. E.g., if I wanted to listen for the first of n sockets to yield something to read, and I wanted to do this using only the provided primitive IO functions, I’d need to perform a read from each of n threads (and have the winner send interrupts to the losers). To do this more efficiently, I’d need to write my own IO functions, ones that cannot be expressed using the threaded abstraction, but must instead fall back to raw asynchronous operations.

In practice, I haven’t had to do much arbitrary composition. Composing IO operations with sleeps (in other words, adding timeouts to IO operations) is important, but that tends to be built in to the IO functions already provided by the threading library. Composing together inter-thread signals and IO operations is also important (e.g. for interrupts), but interrupts also tend to be already provided by the library.

Another downside is that events may be more scalable/flexible in terms of memory consumption—this is one of the main reasons why (system) threads are regarded as generally less scalable than events. The stack requires contiguous memory, whereas event state objects are dynamically heap-allocated, so you needn’t to worry about providing enough room in advance for each “stack.” Conservative over-estimates occupy more space than necessary, so when juggling hundreds of thousands of connections, you may exhaust main memory and incur excessive swapping (thrashing).

Variations: Tamer, Capriccio

Tamer is a preprocessor that lets you annotate functions that are expected to perform asynchronous operations as “tamed”. You also annotate which “stack” variables are used across these calls. Tamer transforms these tamed functions into objects with the appropriate callbacks, and it changes the “stack” variables into fields so that they persist across callbacks.

The Tamer approach lets you compose events without the stack expense, but every time you enter/exit a tamed function, a state object is dynamically allocated/freed. For another approach, Capriccio is a system from Berkeley that introduces a simple idea: use linked stacks, or dynamically growable stacks. I hope we eventually see this concept used in a maintained system that actually builds and is usable on modern platforms.

Also, Tamer’s annotation requirement is both a blessing and a curse: annotations clearly identify at what parts of your program you may be preempted, but of course this reduces flexibility, since you can’t (say) embed an asynchronous operation into certain places, such as inside an streambuf::underflow() (this is the method that pulls more data from the streambuf’s underlying stream source).

In an ideal world, you won’t need to annotate functions, but the fact that a function is asynchronous should be inferred by tools (starting with a root set of primitive asynchronous operations), so that your IDE can highlight which functions are asynchronous. This grants you the freedom to change what’s asynchronous and what isn’t without propagating lots of annotation changes throughout your code, but still remain cognizant of where you might be preempted and where you definitely won’t. This would only work given enough static assumptions, though—hiding asynchronous IO operations behind any form of dynamic dispatch/polymorphism would preclude an analysis that is both accurate and useful.

C++ physical design

A quote from a thread on the Boost user list describing C++ physical design as a trade-off against other, better-established qualities of code:

Whether or not precompiled headers lead to faster builds is besides the point, because they increase coupling.

The point of good physical design is to reduce coupling. Reduced coupling has many benefits beyond its effect on build times, in fact it often leads to slower “full rebuilds”. Also, good physical design sometimes requires compromises with logical design, type safety, or even performance.

The problem is that bad physical design has no measurable effects on anything until it’s too late. Once it becomes a problem, reducing physical coupling is very difficult. Boost with its header only nature is beyond that point, IMO.

Emil Dotchevski
Reverge Studios, Inc.


When working in C++, I occasionally find myself pining for a REPL like those for Python/Haskell/Scheme/[your language here]. Given the size and complexity of the language, I find it useful at times to explore certain nooks and crannies, and REPLs can be a quick way to try things out. It turns out that such a beast exists, and its name is CINT.

The craziest part about CINT is that this is a stand-alone interpreter, i.e. a full implementation of C++. Well, not quite full, but it implements a large subset of C++. CINT also implements a subset of system libraries from POSIX, Win32, and even the STL, but certainly not everything/with limitations. If you’re curious, check out the speeds on the Computer Language Benchmarks Game. This thing was clearly not built for speed.

Putting aside the sheer coolness factor, I haven’t actually used CINT much, and part of this is because (I believe) C++’s design is not amenable to interactive use. CINT does add a few language extensions which make it slightly more usable, including limited “type inference” and looser distinguishing of . from ->. However, the C++ test cases I’ve been coming up with are almost always at least several lines of code consisting of some function/class/template declarations—clumsier to input using readline.

There also exists a REPL for C: Evan Martin’s C-REPL. Unlike CINT, it’s a simpler program that uses incremental compilation tricks. Basically, every time you enter a line, C-REPL generates a source file, which then gets compiled into a shared library, which in turn is dynamically linked with and executed. As such, C-REPL supports the full language, system libraries, anything. And it’s fast. But it only works with C.

Lazy C++

Ever been fed up with having to separate declaration from definition in C++? Annoyed by the need to forward-declare your types/variables/functions, or at least write them in dependency-dictated order? Sick of keeping header files in sync with the source? With Lazy C++, you can leave your frustrations behind! Simply plop both your declarations and your definitions into a single file, and Lazy C++ will automagically tease the code apart into a standard header file and source file. The website and documentation provide self-explanatory examples:

For example, given the following code:

// A.lzz
class A
  inline void f (int i) { ... }
  void g (int j = 0) { ... }
bool operator == (A const & a1, A const & a2) { ... }

Lzz will generate a header file:

// A.h
#ifndef LZZ_A_h
#define LZZ_A_h

class A
  void f (int i);
  void g (int j = 0);
inline void A::f (int i) { ... }
bool operator == (A const & a1, A const & a2);

And a source file:

// A.cpp
#include "A.h"

void A::g (int j) { ... }
bool operator == (A const & a1, A const & a2) { ... }

Lazy C++ is a simple idea, but its implementation is at best tedious, requiring the parsing of a non-trivial subset of C++. It gets things right, though. It knows, for instance, to place inlines and templates into headers, to move method and function bodies out of the header when possible, and to restrict static declarations to the source. It also has escape hatches to let you directly place code into the header or source (this is particularly useful when using macros like unit test definitions). It tries to leave function bodies unparsed, which provides flexibility in allowing for C++ language extensions (such as clamp’s lambda expressions).

D—whose primary implementation recently had its source released (albeit not fully free nor open-source)—takes things a step further:

Features To Drop [from C/C++]

  • Forward declarations. C compilers semantically only know about what has lexically preceded the current state. C++ extends this a little, in that class members can rely on forward referenced class members. D takes this to its logical conclusion, forward declarations are no longer necessary at the module level. Functions can be defined in a natural order rather than the typical inside-out order commonly used in C programs to avoid writing forward declarations.
  • Include files. A major cause of slow compiles as each compilation unit must reparse enormous quantities of header files. Include files should be done as importing a symbol table.

While I’m on it: D is a pretty interesting language to follow, for a number of reasons. It has the feel of being a sensible evolution and clean-up of C++, and it’s the only thing out there besides C/C++ that one can pick up and start using right away as a systems programming language with (accessible) manual memory management. D is actually being used/gathering libraries within a growing community. Also, the language has some pretty good people working on its design, including C++ experts like Andrei Alexandrescu.

The C++ Lambda Preprocessor

Tired of waiting around for anonymous closures to arrive in C++0x (or at least in GCC)? Boost.Lambda and other hacks leaving you pining for something with free variable capture? Get them today with the amazing C++ Lambda Preprocessor (“clamp”)!

Here’s a little demo illustrating the use of lambdas with threads—something which may be more interesting to systems programmers than the simple accumulator examples on the clamp homepage.

#include <boost/thread.hpp>
#include <iostream>

using namespace boost;
using namespace std;
#include "lambda_impl.clamp_h"

enum { n = 3 };
int main() {
  const string msgs[n] = {"first", "second", "third"};
  mutex m;
  for (int i = 0; i < n; ++i) {
    boost::thread(lambda() {
      mutex::scoped_lock l(__ref(m));
      cout << "message from thread " << __ctx(i)
           << ": " << __ref(msgs[i]) << endl;
  return 0;

// clamp < | sed '1d' >
// g++  -lboost_thread-gcc43-mt -o clamp2
// ./clamp2

There are a few annoyances with clamp:

  • All free variables must be explicitly captured with __ref() or __ctx(). This is because clamp doesn’t try to actually parse your C++ (understandably). (Not parsing the C++ also makes it more flexible, accommodating various language extensions.) Contrast this with C++0x, where you can use short-hand notations like [&], which captures all free variables by reference (perhaps the most common case).
  • You will most likely need to muck with the output. clamp generates a number of header files that form the first #include in your source file. However, if you ever refer to global types/variables/functions, you will need to reorder things, add forward declarations, separate the generated header contents into declarations and definitions, etc. This is made nearly trivial, however, if you also use Lazy C++, which I’ll cover next time.

Despite these minor quibbles, clamp is simple and effective. The generated code is straightforward—pretty much exactly what you would expect to write yourself by hand.

unique_ptr and nullptr for C++03

(I’ve been using C++ a bunch lately, so the next several posts will probably be pretty C++-centric.)

unique_ptr is the future. And the future is now! Start using unique_ptr today with this little gem, a C++03-compatible implementation. Of course, you get fancy new C++0x move semantics, pioneered by auto_ptr, but now safer, saner, more widely usable (e.g. in containers), etc. unique_ptr also supports arrays without a separate unique_array class: unique_ptr<T[]> is template-specialized. (I won’t spend this post describing unique_ptr; the provided links already do a much better job at that.)

Some more opinion from the C++03-compatible unique_ptr page:

A const unique_ptr may be considered a better scoped_ptr than even boost::scoped_ptr. The only way to transfer ownership away from a const unique_ptr is by using const_cast. Unlike scoped_ptr, you can’t even swap const unique_ptr’s. The overhead is the same (if using the default_delete or an empty custom deleter – sizeof(unique_ptr<T>) == sizeof(scoped_ptr<T>)). And custom deleters are not even a possibility with scoped_ptr.

(Note BTW that boost::interprocess::unique_ptr is not the same as the C++0x unique_ptr.)

In general, you can already achieve move semantics without rvalue-references, just by using const_cast (or mutable). This is indeed how this unique_ptr implementation works. The core idea illustrated:

template<typename T>
class moveable_ptr {
  T *p;

  // Disable copying from lvalues.
  moveable_ptr(moveable_ptr<T> &o) : p(nullptr) {}

  moveable_ptr(T *p) : p(p) {}

  // In C++0x:
  // moveable_ptr(moveable_ptr<T> &&o) : p(o.p) {
  //   o.p = nullptr;
  // }

  // In C++03, settle with:
  moveable_ptr(const moveable_ptr<T> &o) : p(o.p) {
    const_cast<T*>(o.p) = nullptr;

  ~moveable_ptr() { if (p) delete p; }
  T *get() const { return p; }

Notice above that I use nullptr instead of 0 or NULL—that’s another C++0x feature you can begin using right away, and without any hackery at all.

These goodies are all collected in my growing C++ Commons library.