Update
I now have a whole fancy library for random sampling in Clojure, including stream sampling and reservoir sampling with lots of extras (with and without replacement, weighting, seeds). It's available here:https://github.com/bigmlcom/sampling
Recently I've been looking at sampling techniques for very large or streaming datasets. Specifically, I needed an algorithm that could perform random sampling with replacement during a single pass over a dataset too large to fit into memory.
A bit of searching led me a method that accomplishes just that using a dynamic sample reservoir. However, I wanted the ability to generate very large samples. Just like the original dataset, the sample might not fit into memory. So reservoirs were out.
Fortunately my problem had a simplification that made things much easier. The reservoir approach assumed that the size of the original dataset was unknown. Not true in my case. Before I start the sample I'll know whether the dataset has a million instances or a billion.
This piece of information lets me cheat. Given the overall population size (the number of instances in the original dataset) and the intended sample size, I can calculate the occurrence probabilities for a single instance. I can find how likely an instance is to be sampled once, twice, three times, etc.
Once I have that probability distribution I can iterate over the original dataset. For each instance I just roll the die to determine how many times to sample it.
To calculate the occurrence probabilities, I use the following equation. Let x be the occurrences, l be the sample size, and n the original population size:
In essence, this equation is taking the probability of an instance being selected x times, not selected (l-x) times, and multiplying it all by total number of ways this could happen.
To build a full distribution I calculate the probability of each occurrence starting at 0 and going up until I've captured nearly all the probability mass. This seems to works quite nicely, even for wacky situations like over-sampling the original population.
These are the strong points of this method:
- Single pass sampling with replacement for any size dataset without any significant memory requirements
- The original dataset can be split into chunks, sampled separately, and then recombined (perfect for map-reduce)
- More than one sample set could be built in the same pass over the original data
But with these caveats:
- Must know the size of the original dataset
- The final size of the sample set will be near the intended size - but will be a little bigger or smaller depending on lady luck
- The order of the sampled data won't be randomized
And finally, my Clojure implementation of this technique:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
;; For a fully fleshed out library, see: | |
;; https://github.com/bigmlcom/sampling | |
;; ------------------------------------- | |
;; see http://ashenfad.blogspot.com/2011/06/single-pass-sampling-with-replacement.html | |
(ns sample.replacement | |
(:use [clojure.contrib.math :only [expt]])) | |
(defn- choose [a b] | |
(let [numerator (apply * (range (- (inc a) b) (inc a))) | |
denominator (apply * (range 1 (inc b)))] | |
(/ numerator denominator))) | |
(defn- find-occurance-prob [population-size sample-size occurances] | |
(let [single-select-prob (double (/ 1 population-size)) | |
single-unselect-prob (- 1 single-select-prob) | |
occurance-prob (expt single-select-prob occurances) | |
non-occurances (- sample-size occurances) | |
non-occurance-prob (expt single-unselect-prob non-occurances) | |
combinations (choose sample-size occurances) | |
probability (* occurance-prob non-occurance-prob combinations)] | |
probability)) | |
(defn- normalize-dist [dist-seq] | |
(let [sum (apply + (map second dist-seq))] | |
(map (fn [x] [(first x) (/ (second x) sum)]) dist-seq))) | |
(defn- occurance-dist [population-size sample-size] | |
(loop [cd-value 0 | |
distribution []] | |
(if (< cd-value 0.999999) | |
(let [occurances (count distribution) | |
occurance-prob (find-occurance-prob | |
population-size | |
sample-size | |
occurances)] | |
(recur | |
(+ cd-value occurance-prob) | |
(conj distribution [occurances occurance-prob]))) | |
(normalize-dist distribution)))) | |
(defn occurance-map [population-size sample-size] | |
(let [init-dist-seq (occurance-dist population-size sample-size)] | |
(loop [sum 0 | |
dist-seq init-dist-seq | |
dist-map (sorted-map)] | |
(let [current-occurance (first dist-seq) | |
rest-dist-seq (rest dist-seq) | |
occurance-count (first current-occurance) | |
occurance-sum (+ sum (second current-occurance)) | |
new-dist-map (assoc dist-map occurance-sum occurance-count)] | |
(if (seq rest-dist-seq) | |
(recur occurance-sum rest-dist-seq new-dist-map) | |
new-dist-map))))) | |
(defn- roll-occurances [dist-map] | |
(second (first (subseq dist-map >= (rand))))) | |
(defn- sample-occurances [dist-map datum] | |
(take (roll-occurances dist-map) (repeat datum))) | |
(defn sample-with-replacement-dist [dist-map data] | |
(mapcat | |
(fn [datum] (sample-occurances dist-map datum)) | |
data)) | |
(defn sample-with-replacement [population-size sample-size data] | |
(sample-with-replacement-dist | |
(occurance-map population-size sample-size) | |
(take population-size data))) | |
; example: | |
;(def population-size 100000) | |
;(def sample-size 10000) | |
;(def data (take population-size (repeatedly (fn [] (rand-int 10000))))) | |
;(def sample (sample-with-replacement population-size sample-size data)) |