Tuesday, December 21, 2010

Decision Trees & Hadoop - Part 1: Data

Part 1: Data | Part 2: Approach | Part 3: Results

-----
Update:  Looks like some folks at Google have built a similar system (albeit much more polished).  You can check it out here.
-----

Recently I've been exploring Hadoop and some of its satellite projects. MapReduce is a powerful paradigm for tackling huge datasets and I wanted a project to help me ease into it all. Mahout, a collection of machine learning algorithms for Hadoop, didn't yet have a C4.5-like implementation for decision trees. C4.5 trees are useful yet conceptually simple, so that seemed like a good place to start.

The following few posts will cover my approach toward building a C4.5-like classification tree for Hadoop and follow up with my plans for a regression tree.  This is an educational exercise for me, so if you notice I'm doing something stupid or just have some MapReduce tips or tricks, please let me know!


Why do we care?

First things first, why do we care? Why bother with all this Hadoop stuff? How about just sampling a large dataset until the result is small enough for a single machine?

For the most part, sampling really is good enough. But if a problem space is extremely complex, then a smaller sample may not give an adequately clear view.

As an example, I've chosen an area of the Mandelbrot set (in Elephant Valley) as my problem space. I choose random points and label them true or false depending on whether they're in the Mandelbrot set or not.
Mandelbrot CSV Sample:
0.322, 0.093, false
0.301, 0.085, false
0.377, 0.086, true
0.321, 0.089, false
I generated four datasets from this space with sizes of 10 million, 1 million, 100 thousand, and 10 thousand points. I plotted the datasets below by coloring a point red if it's within the Mandelbrot set, green if it is not. As you can see in the plots, 10 million points gives a fairly high fidelity picture of the space. Not surprisingly, 1 million, 100K, and 10K samples give progressively worse detail.  A classifier trained on one of the smaller samples will do well for the majority of space but the lack of resolution near the boundary will make that area troublesome.

Using either Weka's RepTrees (similar to C4.5) or RandomForests, I can learn decision trees for the 10K, 100K, and 1M datasets. The 10M dataset, however, is too large to train the classifiers inside my humble 2GB JVM.

I also generated a second set of data that mirrors the first, but adds 50 dummy features. Each dummy feature is simply a random number between 0 and 1. For this dataset the Weka tree classifiers can only cope with the 10K and 100K samples.

A Hadoop classifier can tackle the larger datasets and, hopefully, produce better results for points in the tricky areas of our problem space. Sampling and using a meta-classifier (like bagging) might do just as well. But, like I said earlier, this is an educational exercise. :)

10M Sample

1M Sample

100K Sample

10K Sample

So that's the brief introduction to the data I'll be using in the following posts. Head to the next section to read about the general design for my Hadoop classification tree.

Continue to Part 2: Approach.