Too Many NumLumps

Friday, May 13, 2011

Another Clojure snippet

In my continuing adventure learning Clojure, I've put together a bit of code that takes a collection of strings and filters any that are contained in another string.

To whoever is interested:

Saturday, May 7, 2011

Once upon a time my frequent and unlucky defeats while playing Risk led me to build super simple Monte Carlo-style simulator for Risk battles. It didn't help me win any more often, but at least I could complain about my losses by showing just how unlikely they were. It was a consolation prize of sorts.

Recently I've been learning Clojure with the intent of trying out Cascalog for simplifying Hadoop workflows. It's been ages since I've programmed in a functional language, so I wanted a bite sized project to help get comfy with Clojure. The Risk battle simulator seemed like a good fit - so here it is rebuilt with Clojure-y goodness.

Like I said, I'm new to Clojure, so don't assume any of that is idiomatic. Nonetheless, the code is pretty concise compared to my Java implementation and the slick Incanter library made it easy to chart the distribution of outcomes for a battle.

Running "(view-outcomes {:attackers 10 :defenders 10} 1000000)" will compute and chart the outcomes of a million 10 vs. 10 risk battles. The magnitude on the outcome axis indicates how many armies are left, the negative range means the defender wins and positive means the attacker wins.

So the next time your 20 armies are destroyed by 12 measly defenders, you can point at this chart and complain with accuracy:

Saturday, March 5, 2011

Decision Trees & Hadoop - Part 3: Results

Part 1: Data | Part 2: Approach | Part 3: Results

And now the (exciting?) conclusion of my adventures building a C4.5-like decision tree with Hadoop. Since my previous post I've implemented the design, added a few new wrinkles, and collected the results using my Mandelbrot toy problem.

I'm skipping the low level details, but the "new wrinkles" are two additional map-reduce tasks.

One of the tasks triggers whenever the number of instances in a tree node drops below a predefined threshold. The threshold is set low enough that we know the instances can fit into the processes' memory. So we go ahead and grow the rest of that tree branch in the standard recursive manner, greatly reducing the time it takes to build the entire tree.

The second task simply searches for instances that belong to leaves of the tree (nodes we no longer consider for splitting) and removes them. This reduces the amount of data we need to evaluate in future iterations.

I ran the Hadoop tree algorithm on the toy dataset I described in part 1 and compared it to the RepTree and RandomForest techniques from Weka's collection. RepTree and the RandomForest outperformed HadoopTree for 10K, 100K, and 1M datasets, but failed to build on the 10M dataset (using a 2 GB JVM). The HadoopTree trained on the 10M dataset had the best overall accuracy.

To give the HadoopTree fair competition I added 11 bagged RepTrees (bagging being a more traditional way to tackle giant datasets). The bagged RepTree's performance was nearly an exact match of the HadoopTree's results.

In the second round of tests I used Mandelbrot data with 50 dummy numeric features. This made the dataset much larger and RepTree and RandomForest failed to build on both the 1M and 10M datasets. Random splits don't do well with a large number of junk features so, not surprisingly, the RandomForest's performance suffered on this dataset.

The HadoopTree trained on 10M instances carries the day, but a bag of 101 RepTrees comes in a close second. Considering that bagged smaller trees are much faster to grow and would do better on noisy data, the single HadoopTree is more novelty than practical. Although I'm not implementing it at the moment, I expect this technique would provide a bigger payoff if it were modified to grow boosted regression trees.

See the full results here.

The code is viewable on GitHub, but be warned, it's still a toy project. I haven't serialized the output from the map-reduce tasks (so there's a lot more data transfer than there should be) or made a proper parameter/config file.

Tuesday, January 18, 2011

Hadoop cluster on EC2 using Cloudera distribution of Whirr

Motivation

There are a number of different tutorials that have been posted to get Hadoop up on an EC2 cluster and then run Hadoop jobs on this cluster from a remote machine. I ended up using Whirr from Cloudera CDH3 and have been through a number of websites and discussion groups. But thus far I have not found a way to get everything up without a few headaches. I thought it might be useful to post what worked for me and warn of some pitfalls along the way. These instructions or for a local machine running Ubuntu 10.10.

Terminology

Most of the install will be done on your local machine with a bit of testing on the name node of the Hadoop cluster running on EC2. Shell commands executed on your local machine will start with a dollar sign whereas shell commands executed on the remote name node will begin with a hash.

commands run on the local machine



$ command

commands run on name node of the cluster



# command

Pitfalls

Debian Packages are not worth it

The Debian packages for Cloudera, including Hadoop and Whirr were built for Ubuntu 10.4. Since I wanted to work with the latest release of Ubuntu, which at the time of this writing is 10.10, the Debian packages for Cloudera were more difficult to install than just grabbing the tarball.

Hadoop versions on local machine and cluster must match

Hadoop must be the same version for the cluster and the local machine. The default Whirr instance at the time of writing has Hadoop 0.20.2+737. Therefore the 0.20.737 tarball must be used to run Hadoop jobs provided Cloudera AMI-based cluster. I am currently using Cloudera CDH3 Hadoop 0.20.2+737.

Depricated Hadoop configuration scheme

Whirr uses a deprecated Hadoop configuration scheme. It has not been an issue yet but it may be something to watch out for.

Installation Instructions

Install Open JDK

Install the package from the apt repositories

$ sudo apt-get install openjdk-6-jre

Set the $JAVA_HOME environment variable (see elsewhere for instructions on permanently setting environment variables)

$ export JAVA_HOME=/usr/lib/jvm/java-6-openjdk

Set up SSH keys

Generate a keypair to connect to the cluster

$ ssh-keygen -t rsa

Get Hadoop tarball from Cloudera

Download the tarball here: Cloudera CDH3 Hadoop 0.20.2+737.

Install Hadoop locally

$ mkdir cdh3
$ cd cdh3
$ tar -xzvf ../Downloads/hadoop-0.20.2+737.tar.gz
$ ln -s hadoop-0.20.2+737/ hadoop

Feel free to add Hadoop (and later Whirr) to your path so you don't have to specify the whole path with Whirr and Hadoop in future sessions.

Test hadoop locally

You should get a listing of your root file system when you run this command

$ hadoop/bin/hadoop fs -ls /

Get Whirr tarball from Cloudera

Download the tarball here: Cloudera Whirr 0.1.0+23.
Install whirr locally.

$ tar -xzvf ../Downloads/whirr-0.1.0+23.tar.gz
$ ln -s whirr-0.1.0+23/ whirr

Configure Whirr

Edit ~/cdh3/whirr-config/whirr.cfg

whirr.service-name=hadoop

whirr.cluster-name=hadoopcluster

whirr.instance-templates=1 jt+nn,1 dn+tt

whirr.provider=ec2

whirr.identity=<Your AWS ACCESS KEY ID>

whirr.credential=<Your AWS SECRET ACCESS KEY>

whirr.private-key-file=${sys:user.home}/.ssh/id_rsa

whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub

whirr.hadoop-install-runurl=cloudera/cdh/install

whirr.hadoop-configure-runurl=cloudera/cdh/post-configure

This configuration sets up a cluster called hadoopcluster with one job tracker and name node and one data node and task tracker, using your AWS credentials and the rsa keys generated earlier.

Bring up an EC2 cluster using Whirr

$ whirr/bin/whirr launch-cluster --config whirr-config/whirr.cfg

This will take several minutes to complete. When the cluster is up, connection info will be written out both to the screen and to a file for later use.

Start the Proxy to the cluster

$~/.whirr/hadoopcluster/hadoop-proxy.sh

First test: can you SSH to the name node?

If Whirr is able to bring up a cluster successfully, it will print out the public address of the name node. Use this address and the ssh key you generated to connect to the name node via ssh with a command similar to the following but with the provided IP address in the URI:

ssh -i ~/.ssh/id_rsa ec2-user@ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com

Second test: can you invoke Hadoop in an SSH session on the name node?

If you can ssh into the cluster, can you execute a hadoop command locally on the cluster?

# hadoop fs -ls /

(This is the only command that will actually be executed in the name node from within the cluster. The rest of the tutorial will be performed back on your local machine)

Configure Hadoop to use the cluster

$ export HADOOP_CONF_DIR=/$HOME/.whirr/hadoopcluster/

Start the proxy in another shell (and leave it open)

$ sh ~/.whirr/hadoopcluster_dsb/hadoop-proxy.sh

Minimally test hadoop on cluster from your local machine

$ hadoop/bin/hadoop fs -ls /

This is the exact command that you used to test whether Hadoop was working on your local machine. However, Whirr defined a configuration for Hadoop which specifies the EC2 cluster and in updating the $HADOOP_CONF_DIR environment variable we are pointing Hadoop to this configuration.

And this is it! You now can run Hadoop jobs originating from your local machine on an a Hadoop EC2 cluster. You also have what you need to SSH or SCP to the name node in case you would like to work on the cluster directly or upload some data outside of Hadoop. Whirr can also, list the nodes that you have brought up and bring down the clusters that you have brought up.

List the nodes in a cluster

$ whirr/bin/whirr list-cluster --config ~/cdh3/whirr-config/whirr.cfg

Bring down the cluster

Delete the cluster (and ec2 security roles)

$ whirr/bin/whirr destroy-cluster --config ~/cdh3/whirr-config/whirr.cfg