There are a number of different tutorials that have been posted to get Hadoop up on an EC2 cluster and then run Hadoop jobs on this cluster from a remote machine. I ended up using Whirr from Cloudera CDH3 and have been through a number of websites and discussion groups. But thus far I have not found a way to get everything up without a few headaches. I thought it might be useful to post what worked for me and warn of some pitfalls along the way. These instructions or for a local machine running Ubuntu 10.10.
Terminology
Most of the install will be done on your local machine with a bit of testing on the name node of the Hadoop cluster running on EC2. Shell commands executed on your local machine will start with a dollar sign whereas shell commands executed on the remote name node will begin with a hash.
commands run on the local machine
$ command
commands run on name node of the cluster
# command
Pitfalls
Debian Packages are not worth it
The Debian packages for Cloudera, including Hadoop and Whirr were built for Ubuntu 10.4. Since I wanted to work with the latest release of Ubuntu, which at the time of this writing is 10.10, the Debian packages for Cloudera were more difficult to install than just grabbing the tarball.
Hadoop versions on local machine and cluster must match
Hadoop must be the same version for the cluster and the local machine. The default Whirr instance at the time of writing has Hadoop 0.20.2+737. Therefore the 0.20.737 tarball must be used to run Hadoop jobs provided Cloudera AMI-based cluster. I am currently using Cloudera CDH3 Hadoop 0.20.2+737.
Depricated Hadoop configuration scheme
Whirr uses a deprecated Hadoop configuration scheme. It has not been an issue yet but it may be something to watch out for.
Installation Instructions
Install Open JDK
Install the package from the apt repositories
$ sudo apt-get install openjdk-6-jre
Set the $JAVA_HOME environment variable (see elsewhere for instructions on permanently setting environment variables)
$ export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
Set up SSH keys
Generate a keypair to connect to the cluster
$ ssh-keygen -t rsa
Get Hadoop tarball from Cloudera
Download the tarball here: Cloudera CDH3 Hadoop 0.20.2+737.
Install Hadoop locally
$ mkdir cdh3
$ cd cdh3
$ tar -xzvf ../Downloads/hadoop-0.20.2+737.tar.gz
$ ln -s hadoop-0.20.2+737/ hadoop
Feel free to add Hadoop (and later Whirr) to your path so you don't have to specify the whole path with Whirr and Hadoop in future sessions.
Test hadoop locally
You should get a listing of your root file system when you run this command
$ hadoop/bin/hadoop fs -ls /
Get Whirr tarball from Cloudera
Download the tarball here: Cloudera Whirr 0.1.0+23.
Install whirr locally.
$ tar -xzvf ../Downloads/whirr-0.1.0+23.tar.gz
$ ln -s whirr-0.1.0+23/ whirr
Configure Whirr
Edit ~/cdh3/whirr-config/whirr.cfg
whirr.service-name=hadoop
whirr.cluster-name=hadoopcluster
whirr.instance-templates=1 jt+nn,1 dn+tt
whirr.provider=ec2
whirr.identity=<Your AWS ACCESS KEY ID>
whirr.credential=<Your AWS SECRET ACCESS KEY>
whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
whirr.hadoop-install-runurl=cloudera/cdh/install
whirr.hadoop-configure-runurl=cloudera/cdh/post-configure
This configuration sets up a cluster called hadoopcluster with one job tracker and name node and one data node and task tracker, using your AWS credentials and the rsa keys generated earlier.
Bring up an EC2 cluster using Whirr
$ whirr/bin/whirr launch-cluster --config whirr-config/whirr.cfg
This will take several minutes to complete. When the cluster is up, connection info will be written out both to the screen and to a file for later use.
Start the Proxy to the cluster
$~/.whirr/hadoopcluster/hadoop-proxy.sh
First test: can you SSH to the name node?
If Whirr is able to bring up a cluster successfully, it will print out the public address of the name node. Use this address and the ssh key you generated to connect to the name node via ssh with a command similar to the following but with the provided IP address in the URI:
ssh -i ~/.ssh/id_rsa ec2-user@ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com
Second test: can you invoke Hadoop in an SSH session on the name node?
If you can ssh into the cluster, can you execute a hadoop command locally on the cluster?
# hadoop fs -ls /
(This is the only command that will actually be executed in the name node from within the cluster. The rest of the tutorial will be performed back on your local machine)
Configure Hadoop to use the cluster
$ export HADOOP_CONF_DIR=/$HOME/.whirr/hadoopcluster/
Start the proxy in another shell (and leave it open)
$ sh ~/.whirr/hadoopcluster_dsb/hadoop-proxy.sh
Minimally test hadoop on cluster from your local machine
$ hadoop/bin/hadoop fs -ls /
This is the exact command that you used to test whether Hadoop was working on your local machine. However, Whirr defined a configuration for Hadoop which specifies the EC2 cluster and in updating the $HADOOP_CONF_DIR environment variable we are pointing Hadoop to this configuration.
And this is it! You now can run Hadoop jobs originating from your local machine on an a Hadoop EC2 cluster. You also have what you need to SSH or SCP to the name node in case you would like to work on the cluster directly or upload some data outside of Hadoop. Whirr can also, list the nodes that you have brought up and bring down the clusters that you have brought up.
List the nodes in a cluster
$ whirr/bin/whirr list-cluster --config ~/cdh3/whirr-config/whirr.cfg
Bring down the cluster
Delete the cluster (and ec2 security roles)
$ whirr/bin/whirr destroy-cluster --config ~/cdh3/whirr-config/whirr.cfg