Update
I now have a whole fancy library for random sampling in Clojure, including stream sampling and reservoir sampling with lots of extras (with and without replacement, weighting, seeds). It's available here:https://github.com/bigmlcom/sampling
Recently I've been looking at sampling techniques for very large or streaming datasets. Specifically, I needed an algorithm that could perform random sampling with replacement during a single pass over a dataset too large to fit into memory.
A bit of searching led me a method that accomplishes just that using a dynamic sample reservoir. However, I wanted the ability to generate very large samples. Just like the original dataset, the sample might not fit into memory. So reservoirs were out.
Fortunately my problem had a simplification that made things much easier. The reservoir approach assumed that the size of the original dataset was unknown. Not true in my case. Before I start the sample I'll know whether the dataset has a million instances or a billion.
This piece of information lets me cheat. Given the overall population size (the number of instances in the original dataset) and the intended sample size, I can calculate the occurrence probabilities for a single instance. I can find how likely an instance is to be sampled once, twice, three times, etc.
Once I have that probability distribution I can iterate over the original dataset. For each instance I just roll the die to determine how many times to sample it.
To calculate the occurrence probabilities, I use the following equation. Let x be the occurrences, l be the sample size, and n the original population size:
In essence, this equation is taking the probability of an instance being selected x times, not selected (l-x) times, and multiplying it all by total number of ways this could happen.
To build a full distribution I calculate the probability of each occurrence starting at 0 and going up until I've captured nearly all the probability mass. This seems to works quite nicely, even for wacky situations like over-sampling the original population.
These are the strong points of this method:
- Single pass sampling with replacement for any size dataset without any significant memory requirements
- The original dataset can be split into chunks, sampled separately, and then recombined (perfect for map-reduce)
- More than one sample set could be built in the same pass over the original data
But with these caveats:
- Must know the size of the original dataset
- The final size of the sample set will be near the intended size - but will be a little bigger or smaller depending on lady luck
- The order of the sampled data won't be randomized
And finally, my Clojure implementation of this technique: