Thursday, February 21, 2013

Streaming histograms... faster

I just pushed v3.1.0 of the BigML streaming histogram library. For those stumbling across this blog, the library provides an embellished implementation of the histograms introduced by Ben Haim in JMLR. It's a handy way to compress a stream of data using limited memory. The algorithm is online/streaming, so it only needs on pass over the data and it can always give you an estimate of the distribution so far.

I recently noticed that Hive includes a simple implementation. That makes sense. The histograms are merge friendly, so they're a good tool for distributed systems. But Hive's version was much faster than mine for small histograms. That bugged me. :P

Hive uses a simple array to back the histogram. That's bad for histograms with lots of bins, since inserting a point costs O(n) with respect to # of bins in the histogram. My histograms had a better time order, O(logn), but also more overhead. Version 3.1.0 includes both approaches and switches between them depending on your histogram.

Anyway, it's a Java library but with lots of bells and whistles for Clojure devs. If you deal with streaming data on the JVM, give it a look!