Tuesday, January 18, 2011

Hadoop cluster on EC2 using Cloudera distribution of Whirr

Motivation

There are a number of different tutorials that have been posted to get Hadoop up on an EC2 cluster and then run Hadoop jobs on this cluster from a remote machine.  I ended up using Whirr from Cloudera CDH3 and have been through a number of websites and discussion groups.  But thus far I have not found a way to get everything up without a few headaches.  I thought it might be useful to post what worked for me and warn of some pitfalls along the way.  These instructions or for a local machine running Ubuntu 10.10.

Terminology

Most of the install will be done on your local machine with a bit of testing on the name node of the Hadoop cluster running on EC2.  Shell commands executed on your local machine will start with a dollar sign whereas shell commands executed on the remote name node will begin with a hash.

commands run on the local machine

$ command

commands run on name node of the cluster

# command

Pitfalls

Debian Packages are not worth it

The Debian packages for Cloudera, including Hadoop and Whirr were built for Ubuntu 10.4.  Since I wanted to work with the latest release of Ubuntu, which at the time of this writing is 10.10,  the Debian packages for Cloudera were more difficult to install than just grabbing the tarball.

Hadoop versions on local machine and cluster must match

Hadoop must be the same version for the cluster and the local machine.  The default Whirr instance at the time of writing has Hadoop 0.20.2+737.  Therefore the 0.20.737 tarball must be used to run Hadoop jobs provided Cloudera AMI-based cluster.  I am currently using Cloudera CDH3 Hadoop 0.20.2+737.

Depricated Hadoop configuration scheme

Whirr uses a deprecated Hadoop configuration scheme.  It has not been an issue yet but it may be something to watch out for.

Installation Instructions

Install Open JDK

Install the package from the apt repositories

$ sudo apt-get install openjdk-6-jre 

Set the $JAVA_HOME environment variable (see elsewhere for instructions on permanently setting environment variables)

$ export JAVA_HOME=/usr/lib/jvm/java-6-openjdk

Set up SSH keys

Generate a keypair to connect to the cluster

$ ssh-keygen -t rsa

Get Hadoop tarball from Cloudera

Download the tarball here: Cloudera CDH3 Hadoop 0.20.2+737.

Install Hadoop locally

$ mkdir cdh3
$ cd cdh3
$ tar -xzvf ../Downloads/hadoop-0.20.2+737.tar.gz
$ ln -s hadoop-0.20.2+737/ hadoop

Feel free to add Hadoop (and later Whirr) to your path so you don't have to specify the whole path with Whirr and Hadoop in future sessions.

Test hadoop locally

You should get a listing of your root file system when you run this command

$ hadoop/bin/hadoop fs -ls /

Get Whirr tarball from Cloudera

Download the tarball here: Cloudera Whirr 0.1.0+23.
Install whirr locally.

$ tar -xzvf ../Downloads/whirr-0.1.0+23.tar.gz
$ ln -s whirr-0.1.0+23/ whirr

Configure Whirr

Edit ~/cdh3/whirr-config/whirr.cfg

whirr.service-name=hadoop
whirr.cluster-name=hadoopcluster
whirr.instance-templates=1 jt+nn,1 dn+tt
whirr.provider=ec2
whirr.identity=<Your AWS ACCESS KEY ID>
whirr.credential=<Your AWS SECRET ACCESS KEY>
whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
whirr.hadoop-install-runurl=cloudera/cdh/install
whirr.hadoop-configure-runurl=cloudera/cdh/post-configure


This configuration sets up a cluster called hadoopcluster with one job tracker and name node and one data node and task tracker, using your AWS credentials and the rsa keys generated earlier.

Bring up an EC2 cluster using Whirr

$ whirr/bin/whirr launch-cluster --config whirr-config/whirr.cfg

This will take several minutes to complete.  When the cluster is up, connection info will be written out both to the screen and to a file for later use.  

Start the Proxy to the cluster

$~/.whirr/hadoopcluster/hadoop-proxy.sh

First test: can you SSH to the name node?

If Whirr is able to bring up a cluster successfully, it will print out the public address of the name node.  Use this address and the ssh key you generated to connect to the name node via ssh with a command similar to the following but with the provided IP address in the URI:

ssh -i ~/.ssh/id_rsa ec2-user@ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com

Second test: can you invoke Hadoop in an SSH session on the name node?

If you can ssh into the cluster, can you execute a hadoop command locally on the cluster? 

# hadoop fs -ls /

(This is the only command that will actually be executed in the name node from within the cluster.  The rest of the tutorial will be performed back on your local machine)

Configure Hadoop to use the cluster

$ export HADOOP_CONF_DIR=/$HOME/.whirr/hadoopcluster/

Start the proxy in another shell (and leave it open)

$ sh ~/.whirr/hadoopcluster_dsb/hadoop-proxy.sh

Minimally test hadoop on cluster from your local machine

$ hadoop/bin/hadoop fs -ls /

This is the exact command that you used to test whether Hadoop was working on your local machine.  However, Whirr defined a configuration for Hadoop which specifies the EC2 cluster and in updating the $HADOOP_CONF_DIR environment variable we are pointing Hadoop to this configuration.

And this is it!  You now can run Hadoop jobs originating from your local machine on an a Hadoop EC2 cluster.  You also have what you need to SSH or SCP to the name node in case you would like to work on the cluster directly or upload some data outside of Hadoop.  Whirr can also, list the nodes that you have brought up and bring down the clusters that you have brought up.

List the nodes in a cluster

$ whirr/bin/whirr list-cluster --config ~/cdh3/whirr-config/whirr.cfg

Bring down the cluster

Delete the cluster (and ec2 security roles)

$ whirr/bin/whirr destroy-cluster --config ~/cdh3/whirr-config/whirr.cfg