No-nonsense getting started with standalone Hadoop and Dumbo on Ubuntu

Dumbo is a nifty Python package from the Audioscrobbler data crunchers at last.fm that lets you write Hadoop (Hadoop Streaming) jobs in Python. In this getting-started guide, we’ll install Cloudera’s distribution of Hadoop and Dumbo on Ubuntu, with minimal fuss. For more elaborate documentation, see the Cloudera documentation archives.

Installing Hadoop

First, set up Cloudera’s apt repositories. They don’t have Karmic (9.10) repositories yet, but you can just use their Jaunty packages without problems:

sudo bash -c 'cat > /etc/apt/source.list.d/cloudera.desktop' << EOF
deb http://archive.cloudera.com/debian jaunty-testing contrib
deb-src http://archive.cloudera.com/debian jaunty-testing contrib
EOF

curl -s http://archive.cloudera.com/debian/archive.key |
    sudo apt-key add -

sudo aptitude update

Now you can install the latest Hadoop (currently 0.20). You can get it configured for either:

  • standalone mode, in which case it just uses your local (Linux) filesystem—handy if you’re just doing development work, since you don’t need to import/export things to/from HDFS—or
  • pseudo-cluster mode, where all five node types run locally, but otherwise uses the full distributed software stack (and hence a real HDFS instance).

To install in standalone mode:

sudo aptitude install hadoop-0.20
hadoop fs -ls / # Notice it's your actual /
hadoop jar ... # You can run jobs as a normal user.

To install in pseudo-cluster mode:

sudo aptitude install hadoop-0.20-conf-pseudo-desktop
# Start all five daemons.
for i in /etc/init.d/hadoop-0.20-* ; do sudo $i start ; done
sudo hadoop jar ... # Running jobs requires root.

At this point, you can try running some of the example programs:

# The pi calculation example
hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar pi 2 100000

# The grep example (assuming you're in standalone mode)
mkdir /tmp/input
for i in aaa bbb ccc ; do echo $i > /tmp/input/$i ; done
hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar \
    grep /tmp/input /tmp/output '[a-b]+'

Installing Dumbo

Dumbo usually requires some patches for Hadoop 0.20, but Cloudera’s latest distribution of Hadoop 0.20 includes these patches.

Installing Dumbo works just like with any other Python SetupTools package, and is most easily done with the easy_install or pip tools:

easy_install dumbo

# Alternatively...
wget 'http://github.com/klbostee/dumbo/zipball/release-0.21.24'
unzip klbostee-dumbo-8806a2f.zip
cd klbostee-dumbo-8806a2f/
sudo python setup.py install

To run Dumbo’s word-count example:

# Run wordcount in standalone mode.
dumbo start wordcount.py -input brian.txt -output brianwc

# Run wordcount using Hadoop.
dumbo start wordcount.py -input brian.txt -output brianwc \
    -hadoop /usr/lib/hadoop

# Look at the output.
dumbo cat brianwc

Follow me on Twitter for stuff far more interesting than what I blog.

  • Jun Xu

    The output brianwc using Hadoop is not a file, instead it's a directory. Could u give some explanations?

  • yaaang

    brianwc is a directory if you're using HDFS, and is a file if you're using the local filesystem. The reason it's a directory in the former case is because multiple result files go in there, one per reducer.

  • I have been using Ubuntu from quite couple of months now and I have been searching ways to improve my Operating system and found this information. Thanks for sharing this.

  • Just i research Ubuntu window. Just i found your blog form Google searches you are given good information about it. Thanks for sharing