Author Archives: yang

256-color xterm

I somehow managed to never discover this until today. I was chatting with Ryan about myriad topics from spatial programming abstractions to distributed system consistency, but by far the bit of information that would most immediately and dramatically change my life forever was xterm-256color. Simply export TERM=xterm-256color, and bam—you now have a rainbow of colors in Vim and Emacs. It should work right out of the box these days—it works for me in GNOME Terminal and Putty. Fantastic.

Cute silent failure building pycurl

I was trying to build and install pycurl, but it never actually installed properly. import pycurl just complained that the module was missing, and sure enough, I couldn’t find it anywhere, despite that ./ install kept succeeding.

Peeking into an rpm for pycurl verified that was missing. So why wasn’t it getting installed? It wasn’t even in the build directory. It turns out that I had missed the key error message in the jumble below:

Continue reading

No-nonsense getting started with standalone Hadoop and Dumbo on Ubuntu

Dumbo is a nifty Python package from the Audioscrobbler data crunchers at that lets you write Hadoop (Hadoop Streaming) jobs in Python. In this getting-started guide, we’ll install Cloudera’s distribution of Hadoop and Dumbo on Ubuntu, with minimal fuss. For more elaborate documentation, see the Cloudera documentation archives.

Continue reading

Making sense of OpenID, OAuth, OpenSocial, Google Friend Connect, Facebook Connect, and more

Last Thursday I dropped in to the Google SIPB hackathon, where I got a chance to chat with several Googlers in the Cambridge office about the whole ecosystem of decentralized identity and social networking services. I had actually previously spent a bit of time searching for a high-level map laying out how these various services related to each other, strictly out of curiosity, but never really found anything that was succinct, clear, and free of BS. There also seems to be a lot of contradictory information and general confusion. Along with the recent news expecting similar service stacks from Twitter, it seems timely to share all the things I’ve been learning.

The executive summary:

  • OpenID: authentication; use one login across many sites
  • OpenID Attribute Exchange: a key-value store protocol for OpenID
  • OAuth: authorization and authentication; grant sites specific access to your account
  • OAuth WRAP: a simpler OAuth that leverages PKI instead of its own signature scheme
  • OpenSocial: a standard API into social networks
  • Google Friend Connect: an OpenSocial router, plus a bunch of other less-important stuff
  • Facebook Platform: all the above (and a bit more), for the Facebook stack
  • Facebook Connect: establish links between Facebook and third-party user account systems
  • Portable Contacts: just the slice of OpenSocial that deals with contacts

Continue reading

Bitten by Python scoping

Yet again, I wasted too many minutes staring at and debugging my Python code due to the language’s funky variable scoping:

def relevant(xs, y):
  "Return elements in xs that are relevant to y."
  pairs = ((x, relevance(x,y)) for x in xs)
  return [(x,y) for x,y in pairs if y > 0]

In this case, the y in the list comprehension modifies the binding used by the generator expression.

Web Sockets tutorial with simple Python server

The Landscape: HTML5

HTML5 is an emerging and in-flux client-side standard for developing web applications. It’s really more of a rich client platform specification than just a markup language, including the following slew of new features:

  • canvas for vector graphics
  • video and audio for multimedia
  • local offline storage
  • drag and drop operations
  • Web Socket API for bidirectional client-server communications
  • GeoLocation API
  • standard WYSIWYG HTML editor component
  • Web Workers API for message-passing processes
  • webcam and microphone access
  • 3D graphics rendering engine
  • and more…

A lot of this effort is about wrapping up and building into browsers native support for various (proprietary) technologies already in widespread use on the Web, such as Flash video/webcam/mic and Google Gears offline/drag-and-drop. Others are about cleaning up things for which there currently exist various hacks, and Web Sockets fall into this category.

Introducing Web Sockets

This Chromium blog post contains a nice introduction to Web Sockets:

The Web Sockets API enables web applications to handle bidirectional communications with server-side process in a straightforward way. Developers have been using XMLHttpRequest (“XHR”) for such purposes, but XHR makes developing web applications that communicate back and forth to the server unnecessarily complex. XHR is basically asynchronous HTTP, and because you need to use a tricky technique like long-hanging GET for sending data from the server to the browser, simple tasks rapidly become complex. As opposed to XMLHttpRequest, Web Sockets provide a real bidirectional communication channel in your browser. Once you get a Web Socket connection, you can send data from browser to server by calling a send() method, and receive data from server to browser by an onmessage event handler. A simple example is included below.

In addition to the new Web Sockets API, there is also a new protocol (the “web socket protocol”) that the browser uses to communicate with servers. The protocol is not raw TCP because it needs to provide the browser’s “same-origin” security model. It’s also not HTTP because web socket traffic differers from HTTP’s request-response model. Web socket communications using the new web socket protocol should use less bandwidth because, unlike a series of XHRs and hanging GETs, no headers are exchanged once the single connection has been established. To use this new API and protocol and take advantage of the simpler programming model and more efficient network traffic, you do need a new server implementation to communicate with — but don’t worry. We also developed pywebsocket, which can be used as an Apache extension module, or can even be run as standalone server.

(The mentioned technique of a long-hanging GET is also known as Comet.)

Chrome is presently the only browser that has Web Sockets, and only in the Dev channel releases. Firefox and Safari/WebKit support are under way, according to the implementation status page.

The Web Sockets Protocol

The protocol has the client and server do an HTTP-style handshake, where all text is in UTF-8 and newlines include a carriage return and newline. After this, arbitrary data can be sent back and forth, but delimited in frames, which begin with a 0x00 byte and end with a 0xff byte. Contrast this with the byte stream abstraction presented by raw TCP—having the system hand you whole frames frees the application from having to manually buffer and parse messages out of the stream (which the browser may be able to do more efficiently).

As for how this mixes with browser security policies, the basic gist is that the same-origin policy no longer applies. Requiring the Web Socket to communicate only with the same origin (same host and port as where the HTML/Javascript came from) would be a barrier to deployment because it would require the httpd to additionally speak Web Sockets. (That said, the default port for Web Sockets is in fact port 80.) More generally, this prevents all cross-site communication, which is critical to many classes of applications such as mash-ups and widget dashboards.

But the protocol does require the browser to send the origin information to the server, the server to validate this by echoing the origin, and finally the client to validate that the server echoed this. According to the protocol specs, the response must include the exact same origin, location, and protocol as the request, where:

  • the origin is just the (protocol, host, port) triplet (,
  • the location is the target of the request (ws://, and
  • the protocol is an arbitrary string used to identify the exact application-level protocol expected.

(Note that the origin is different from the Referrer, which includes the full resource path, thus leading to privacy concerns. I hope to write more on the Origin header in a broader context and client-side web security in general soon.)

Example Client and Server

To give you a flavor of how to write a complete end-to-end web application using Web Sockets, the following is a simple client and server application where the server sends two messages down to the client, “hello” and “world.” This example is from my sandbox.

The client-side API for Web Sockets is very simple. The example client just connects to a server on port 9876 and alerts the user of each new message. Just to make this a wholesome HTML5 experience, we’ll write everything in XHTML5 (yes, there exists a stricter, XML flavor of HTML5 for those who preferred XHTML over HTML tag soup):

<!DOCTYPE html>
<html xmlns="">
    <title>Web Socket Example</title>
    <meta charset="UTF-8">
      window.onload = function() {
        var s = new WebSocket("ws://localhost:9876/");
        s.onopen = function(e) { alert("opened"); }
        s.onclose = function(e) { alert("closed"); }
        s.onmessage = function(e) { alert("got: " +; }
      <div id="holder" style="width:600px; height:300px"></div>

Now for the server, which is written in Python. It sends the two messages after a one-second delay before each. Note that the server is hard-coding the response to expect a locally connected client; this is for simplicity and clarity. In particular, it requires that the client is being served from localhost:8888. A real server would parse the request, validate it, and generate an appropriate response.

#!/usr/bin/env python

import socket, threading, time

def handle(s):
  print repr(s.recv(4096))
HTTP/1.1 101 Web Socket Protocol Handshake\r
Upgrade: WebSocket\r
Connection: Upgrade\r
WebSocket-Origin: http://localhost:8888\r
WebSocket-Location: ws://localhost:9876/\r
WebSocket-Protocol: sample
  '''.strip() + '\r\n\r\n')

s = socket.socket()
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(('', 9876));
while 1:
  t,_ = s.accept();
  threading.Thread(target = handle, args = (t,)).start()

To run the above, start the Web Socket server (./ and start a web server on port 8888 serving index.html:

./ &
python -m SimpleHTTPServer 8888

Further Exploration

For a more complete application (that’s still reasonably simple), I threw together a real-time X-Y scatter plotting application called Real-Time Plotter. It plots some number of data sources and supports streaming to multiple clients.

The Python server listens for data sources on port 9876. It expects a stream of text, where the first line is the name of the data source and each subsequent line contains a space-separated x-y pair of floating point numbers in the series to be plotted. It listens also on port 9877 for Web Socket clients. A simple data source that issues a random y-value per second can be started from the shell using netcat:

  echo 'my data source'
  while true ; do
    echo "${i}000 $RANDOM"
    sleep 1
} | nc localhost 9876

The client page uses Web Sockets to connect to the server and fetch historical data, as well as start streaming new data. Plotting is done using Flot, a jQuery library for generating decent-looking plots. For throttling when the server is streaming new points quickly, the client only fetches new data (by sending an empty frame) after a complete redraw; the server responds by sending a batch of all new points since the last fetch. (Note: the server’s pump routine currently treats the x values as millisecond timestamps and only issues a single point per second, but this can be easily tweaked/removed.)

Web Sockets can also be used over TLS. This is done by using wss: instead of ws: in the URL, and this defaults to the HTTPS port 443.

Recovering files using TestDisk

I recently had a hard disk go bad. It had a single NTFS partition where the partition was no longer recognized by Windows or Linux.

I managed to recover (I think) all of my files by reading the data off of the disk using TestDisk. It discovered my partition without issue, and the actual file recovery went relatively smoothly.

I believe the way TestDisk was primarily designed to work was by attempting to actually repair the partition. This should be fine if you’re working on a copy of the raw disk bits (you can probably use dd for Windows for this, or write a program that accesses ), but I’d generally be averse to making any further changes to a bad disk. I chose to recover my data just by copying the files directly out. You can do this as follows:

  • go to (your disk) > Intel > Analyse > (your dynamic partition)
  • highlight your partition
  • press P to list files
  • press H to hide deleted files
  • press C to copy the current directory (the directory from which you started TestDisk)

One thing to watch out for when using this approach is to first make sure you press H to hide deleted files. Otherwise, you end up recovering deleted files as well, and sometimes these deleted files (for whatever reason—probably because my data’s sufficiently corrupted) end up pointing back into the root of the file system, leading to cycles in the tree and thus an infinitely recursive file copying procedure.


A while back, I wrote about cooperative threading libraries for C++. Coroutines are a closely related concept—coroutines and cooperative threads can be expressed/implemented in terms of each other.

Conspicuously absent from that coverage is Boost.Coroutine, which I’ll discuss here. The problem with Boost.Coroutine is that it was incomplete, and—last I checked—far from complete. I had spent some time trying to work with the author through its non-starter issues, as I was looking forward to using it in conjunction with Boost.Asio (this was one of Boost.Coroutine’s primary objectives), but the author has not had the time to take his work to the Boost formal review stage.

Re-enabling desktop effects in Windows 7

Every once in a while, Windows 7’s Desktop Window Manager (DWM) spontaneously goes funky on me and loses its effects (transparency, blurring, shadows).

When this happens, you can get your effects back by going to Services and restarting the “Desktop Window Manager Session Manager” service, or by going to an Administrative command prompt and running:

net stop uxsms
net start uxsms