Too Many NumLumps

Friday, June 10, 2011

Streaming (one-pass) Sampling With Replacement

Update

I now have a whole fancy library for random sampling in Clojure, including stream sampling and reservoir sampling with lots of extras (with and without replacement, weighting, seeds). It's available here:
https://github.com/bigmlcom/sampling

Recently I've been looking at sampling techniques for very large or streaming datasets. Specifically, I needed an algorithm that could perform random sampling with replacement during a single pass over a dataset too large to fit into memory.

A bit of searching led me a method that accomplishes just that using a dynamic sample reservoir. However, I wanted the ability to generate very large samples. Just like the original dataset, the sample might not fit into memory. So reservoirs were out.

Fortunately my problem had a simplification that made things much easier. The reservoir approach assumed that the size of the original dataset was unknown. Not true in my case. Before I start the sample I'll know whether the dataset has a million instances or a billion.

This piece of information lets me cheat. Given the overall population size (the number of instances in the original dataset) and the intended sample size, I can calculate the occurrence probabilities for a single instance. I can find how likely an instance is to be sampled once, twice, three times, etc.

Once I have that probability distribution I can iterate over the original dataset. For each instance I just roll the die to determine how many times to sample it.

To calculate the occurrence probabilities, I use the following equation. Let x be the occurrences, l be the sample size, and n the original population size:

In essence, this equation is taking the probability of an instance being selected x times, not selected (l-x) times, and multiplying it all by total number of ways this could happen.

To build a full distribution I calculate the probability of each occurrence starting at 0 and going up until I've captured nearly all the probability mass. This seems to works quite nicely, even for wacky situations like over-sampling the original population.

These are the strong points of this method:

Single pass sampling with replacement for any size dataset without any significant memory requirements
The original dataset can be split into chunks, sampled separately, and then recombined (perfect for map-reduce)
More than one sample set could be built in the same pass over the original data

But with these caveats:

Must know the size of the original dataset
The final size of the sample set will be near the intended size - but will be a little bigger or smaller depending on lady luck
The order of the sampled data won't be randomized

And finally, my Clojure implementation of this technique:

Friday, May 13, 2011

Another Clojure snippet

In my continuing adventure learning Clojure, I've put together a bit of code that takes a collection of strings and filters any that are contained in another string.

To whoever is interested:

Saturday, May 7, 2011

Risk Roll'n (with Clojure)

Once upon a time my frequent and unlucky defeats while playing Risk led me to build super simple Monte Carlo-style simulator for Risk battles. It didn't help me win any more often, but at least I could complain about my losses by showing just how unlikely they were. It was a consolation prize of sorts.

Recently I've been learning Clojure with the intent of trying out Cascalog for simplifying Hadoop workflows. It's been ages since I've programmed in a functional language, so I wanted a bite sized project to help get comfy with Clojure. The Risk battle simulator seemed like a good fit - so here it is rebuilt with Clojure-y goodness.

Like I said, I'm new to Clojure, so don't assume any of that is idiomatic. Nonetheless, the code is pretty concise compared to my Java implementation and the slick Incanter library made it easy to chart the distribution of outcomes for a battle.

Running "(view-outcomes {:attackers 10 :defenders 10} 1000000)" will compute and chart the outcomes of a million 10 vs. 10 risk battles. The magnitude on the outcome axis indicates how many armies are left, the negative range means the defender wins and positive means the attacker wins.

So the next time your 20 armies are destroyed by 12 measly defenders, you can point at this chart and complain with accuracy:

Saturday, March 5, 2011

Decision Trees & Hadoop - Part 3: Results

Part 1: Data | Part 2: Approach | Part 3: Results

And now the (exciting?) conclusion of my adventures building a C4.5-like decision tree with Hadoop. Since my previous post I've implemented the design, added a few new wrinkles, and collected the results using my Mandelbrot toy problem.

I'm skipping the low level details, but the "new wrinkles" are two additional map-reduce tasks.

One of the tasks triggers whenever the number of instances in a tree node drops below a predefined threshold. The threshold is set low enough that we know the instances can fit into the processes' memory. So we go ahead and grow the rest of that tree branch in the standard recursive manner, greatly reducing the time it takes to build the entire tree.

The second task simply searches for instances that belong to leaves of the tree (nodes we no longer consider for splitting) and removes them. This reduces the amount of data we need to evaluate in future iterations.

I ran the Hadoop tree algorithm on the toy dataset I described in part 1 and compared it to the RepTree and RandomForest techniques from Weka's collection. RepTree and the RandomForest outperformed HadoopTree for 10K, 100K, and 1M datasets, but failed to build on the 10M dataset (using a 2 GB JVM). The HadoopTree trained on the 10M dataset had the best overall accuracy.

To give the HadoopTree fair competition I added 11 bagged RepTrees (bagging being a more traditional way to tackle giant datasets). The bagged RepTree's performance was nearly an exact match of the HadoopTree's results.

In the second round of tests I used Mandelbrot data with 50 dummy numeric features. This made the dataset much larger and RepTree and RandomForest failed to build on both the 1M and 10M datasets. Random splits don't do well with a large number of junk features so, not surprisingly, the RandomForest's performance suffered on this dataset.

The HadoopTree trained on 10M instances carries the day, but a bag of 101 RepTrees comes in a close second. Considering that bagged smaller trees are much faster to grow and would do better on noisy data, the single HadoopTree is more novelty than practical. Although I'm not implementing it at the moment, I expect this technique would provide a bigger payoff if it were modified to grow boosted regression trees.

See the full results here.

The code is viewable on GitHub, but be warned, it's still a toy project. I haven't serialized the output from the map-reduce tasks (so there's a lot more data transfer than there should be) or made a proper parameter/config file.