Wednesday, September 3, 2014

Sketching/hashing Algorithms in Clojure

Just a short note that I (and BigML) have open sourced a library of hashing / sketching based stream summarizers for Clojure.

Specifically, the library includes techniques that take streams of items and return summaries that can be queried for set membership (bloom filters), set similarity (min-hashes), item occurrence counts (count-min sketches), and the number of distinct items (with my favorite, the magical HyperLogLog).

This library was largely an educational exercise for me, as I wanted to better understand the world of streaming summaries for categorical data.  It's written in almost pure Clojure and backed by plain Clojure data structures.  So it's (hopefully!) easy to use and easy to serialize.  All the summaries are merge friendly making them a nice fit for distributed settings.  The big caveat is that I didn't spend much effort optimizing for speed.  Those in need of maximizing every CPU cycle may need to look elsewhere.