Friday, June 10, 2011

Streaming (one-pass) Sampling With Replacement

Update

I now have a whole fancy library for random sampling in Clojure, including stream sampling and reservoir sampling with lots of extras (with and without replacement, weighting, seeds). It's available here:
https://github.com/bigmlcom/sampling


Recently I've been looking at sampling techniques for very large or streaming datasets. Specifically, I needed an algorithm that could perform random sampling with replacement during a single pass over a dataset too large to fit into memory.

A bit of searching led me a method that accomplishes just that using a dynamic sample reservoir. However, I wanted the ability to generate very large samples. Just like the original dataset, the sample might not fit into memory. So reservoirs were out.

Fortunately my problem had a simplification that made things much easier.  The reservoir approach assumed that the size of the original dataset was unknown. Not true in my case. Before I start the sample I'll know whether the dataset has a million instances or a billion.

This piece of information lets me cheat. Given the overall population size (the number of instances in the original dataset) and the intended sample size, I can calculate the occurrence probabilities for a single instance. I can find how likely an instance is to be sampled once, twice, three times, etc.


Once I have that probability distribution I can iterate over the original dataset. For each instance I just roll the die to determine how many times to sample it.

To calculate the occurrence probabilities, I use the following equation.  Let x be the occurrences, l be the sample size, and n the original population size:


In essence, this equation is taking the probability of an instance being selected x times, not selected (l-x) times, and multiplying it all by total number of ways this could happen.

To build a full distribution I calculate the probability of each occurrence starting at 0 and going up until I've captured nearly all the probability mass. This seems to works quite nicely, even for wacky situations like over-sampling the original population.


These are the strong points of this method:
  • Single pass sampling with replacement for any size dataset without any significant memory requirements
  • The original dataset can be split into chunks, sampled separately, and then recombined (perfect for map-reduce)
  • More than one sample set could be built in the same pass over the original data

But with these caveats:
  • Must know the size of the original dataset
  • The final size of the sample set will be near the intended size - but will be a little bigger or smaller depending on lady luck
  • The order of the sampled data won't be randomized

And finally, my Clojure implementation of this technique:

;; For a fully fleshed out library, see:
;; https://github.com/bigmlcom/sampling
;; -------------------------------------
;; see http://ashenfad.blogspot.com/2011/06/single-pass-sampling-with-replacement.html
(ns sample.replacement
(:use [clojure.contrib.math :only [expt]]))
(defn- choose [a b]
(let [numerator (apply * (range (- (inc a) b) (inc a)))
denominator (apply * (range 1 (inc b)))]
(/ numerator denominator)))
(defn- find-occurance-prob [population-size sample-size occurances]
(let [single-select-prob (double (/ 1 population-size))
single-unselect-prob (- 1 single-select-prob)
occurance-prob (expt single-select-prob occurances)
non-occurances (- sample-size occurances)
non-occurance-prob (expt single-unselect-prob non-occurances)
combinations (choose sample-size occurances)
probability (* occurance-prob non-occurance-prob combinations)]
probability))
(defn- normalize-dist [dist-seq]
(let [sum (apply + (map second dist-seq))]
(map (fn [x] [(first x) (/ (second x) sum)]) dist-seq)))
(defn- occurance-dist [population-size sample-size]
(loop [cd-value 0
distribution []]
(if (< cd-value 0.999999)
(let [occurances (count distribution)
occurance-prob (find-occurance-prob
population-size
sample-size
occurances)]
(recur
(+ cd-value occurance-prob)
(conj distribution [occurances occurance-prob])))
(normalize-dist distribution))))
(defn occurance-map [population-size sample-size]
(let [init-dist-seq (occurance-dist population-size sample-size)]
(loop [sum 0
dist-seq init-dist-seq
dist-map (sorted-map)]
(let [current-occurance (first dist-seq)
rest-dist-seq (rest dist-seq)
occurance-count (first current-occurance)
occurance-sum (+ sum (second current-occurance))
new-dist-map (assoc dist-map occurance-sum occurance-count)]
(if (seq rest-dist-seq)
(recur occurance-sum rest-dist-seq new-dist-map)
new-dist-map)))))
(defn- roll-occurances [dist-map]
(second (first (subseq dist-map >= (rand)))))
(defn- sample-occurances [dist-map datum]
(take (roll-occurances dist-map) (repeat datum)))
(defn sample-with-replacement-dist [dist-map data]
(mapcat
(fn [datum] (sample-occurances dist-map datum))
data))
(defn sample-with-replacement [population-size sample-size data]
(sample-with-replacement-dist
(occurance-map population-size sample-size)
(take population-size data)))
; example:
;(def population-size 100000)
;(def sample-size 10000)
;(def data (take population-size (repeatedly (fn [] (rand-int 10000)))))
;(def sample (sample-with-replacement population-size sample-size data))