Dumbo is a nifty Python package from the Audioscrobbler data crunchers at last.fm that lets you write Hadoop (Hadoop Streaming) jobs in Python. In this getting-started guide, we’ll install Cloudera’s distribution of Hadoop and Dumbo on Ubuntu, with minimal fuss. For more elaborate documentation, see the Cloudera documentation archives.
First, set up Cloudera’s apt repositories. They don’t have Karmic (9.10) repositories yet, but you can just use their Jaunty packages without problems:
sudo bash -c 'cat > /etc/apt/source.list.d/cloudera.desktop' << EOF deb http://archive.cloudera.com/debian jaunty-testing contrib deb-src http://archive.cloudera.com/debian jaunty-testing contrib EOF curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add - sudo aptitude update
Now you can install the latest Hadoop (currently 0.20). You can get it configured for either:
- standalone mode, in which case it just uses your local (Linux) filesystem—handy if you’re just doing development work, since you don’t need to import/export things to/from HDFS—or
- pseudo-cluster mode, where all five node types run locally, but otherwise uses the full distributed software stack (and hence a real HDFS instance).
To install in standalone mode:
sudo aptitude install hadoop-0.20 hadoop fs -ls / # Notice it's your actual / hadoop jar ... # You can run jobs as a normal user.
To install in pseudo-cluster mode:
sudo aptitude install hadoop-0.20-conf-pseudo-desktop # Start all five daemons. for i in /etc/init.d/hadoop-0.20-* ; do sudo $i start ; done sudo hadoop jar ... # Running jobs requires root.
At this point, you can try running some of the example programs:
# The pi calculation example hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar pi 2 100000 # The grep example (assuming you're in standalone mode) mkdir /tmp/input for i in aaa bbb ccc ; do echo $i > /tmp/input/$i ; done hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar \ grep /tmp/input /tmp/output '[a-b]+'
easy_install dumbo # Alternatively... wget 'http://github.com/klbostee/dumbo/zipball/release-0.21.24' unzip klbostee-dumbo-8806a2f.zip cd klbostee-dumbo-8806a2f/ sudo python setup.py install
To run Dumbo’s word-count example:
# Run wordcount in standalone mode. dumbo start wordcount.py -input brian.txt -output brianwc # Run wordcount using Hadoop. dumbo start wordcount.py -input brian.txt -output brianwc \ -hadoop /usr/lib/hadoop # Look at the output. dumbo cat brianwc
Follow me on Twitter for stuff far more interesting than what I blog.