Saturday, March 5, 2011

Decision Trees & Hadoop - Part 3: Results

Part 1: Data | Part 2: Approach | Part 3: Results

And now the (exciting?) conclusion of my adventures building a C4.5-like decision tree with Hadoop. Since my previous post I've implemented the design, added a few new wrinkles, and collected the results using my Mandelbrot toy problem.

I'm skipping the low level details, but the "new wrinkles" are two additional map-reduce tasks.

One of the tasks triggers whenever the number of instances in a tree node drops below a predefined threshold. The threshold is set low enough that we know the instances can fit into the processes' memory. So we go ahead and grow the rest of that tree branch in the standard recursive manner, greatly reducing the time it takes to build the entire tree.

The second task simply searches for instances that belong to leaves of the tree (nodes we no longer consider for splitting) and removes them. This reduces the amount of data we need to evaluate in future iterations.

I ran the Hadoop tree algorithm on the toy dataset I described in part 1 and compared it to the RepTree and RandomForest techniques from Weka's collection. RepTree and the RandomForest outperformed HadoopTree for 10K, 100K, and 1M datasets, but failed to build on the 10M dataset (using a 2 GB JVM). The HadoopTree trained on the 10M dataset had the best overall accuracy.

To give the HadoopTree fair competition I added 11 bagged RepTrees (bagging being a more traditional way to tackle giant datasets). The bagged RepTree's performance was nearly an exact match of the HadoopTree's results.



In the second round of tests I used Mandelbrot data with 50 dummy numeric features. This made the dataset much larger and RepTree and RandomForest failed to build on both the 1M and 10M datasets. Random splits don't do well with a large number of junk features so, not surprisingly, the RandomForest's performance suffered on this dataset.

The HadoopTree trained on 10M instances carries the day, but a bag of 101 RepTrees comes in a close second. Considering that bagged smaller trees are much faster to grow and would do better on noisy data, the single HadoopTree is more novelty than practical. Although I'm not implementing it at the moment, I expect this technique would provide a bigger payoff if it were modified to grow boosted regression trees.



See the full results here.

The code is viewable on GitHub, but be warned, it's still a toy project. I haven't serialized the output from the map-reduce tasks (so there's a lot more data transfer than there should be) or made a proper parameter/config file.