Wednesday, September 3, 2014

Sketching/hashing Algorithms in Clojure

Just a short note that I (and BigML) have open sourced a library of hashing / sketching based stream summarizers for Clojure.

Specifically, the library includes techniques that take streams of items and return summaries that can be queried for set membership (bloom filters), set similarity (min-hashes), item occurrence counts (count-min sketches), and the number of distinct items (with my favorite, the magical HyperLogLog).

This library was largely an educational exercise for me, as I wanted to better understand the world of streaming summaries for categorical data.  It's written in almost pure Clojure and backed by plain Clojure data structures.  So it's (hopefully!) easy to use and easy to serialize.  All the summaries are merge friendly making them a nice fit for distributed settings.  The big caveat is that I didn't spend much effort optimizing for speed.  Those in need of maximizing every CPU cycle may need to look elsewhere.

Thursday, February 21, 2013

Streaming histograms... faster

I just pushed v3.1.0 of the BigML streaming histogram library. For those stumbling across this blog, the library provides an embellished implementation of the histograms introduced by Ben Haim in JMLR. It's a handy way to compress a stream of data using limited memory. The algorithm is online/streaming, so it only needs on pass over the data and it can always give you an estimate of the distribution so far.

I recently noticed that Hive includes a simple implementation. That makes sense. The histograms are merge friendly, so they're a good tool for distributed systems. But Hive's version was much faster than mine for small histograms. That bugged me. :P

Hive uses a simple array to back the histogram. That's bad for histograms with lots of bins, since inserting a point costs O(n) with respect to # of bins in the histogram. My histograms had a better time order, O(logn), but also more overhead. Version 3.1.0 includes both approaches and switches between them depending on your histogram.

Anyway, it's a Java library but with lots of bells and whistles for Clojure devs. If you deal with streaming data on the JVM, give it a look!

Saturday, January 26, 2013

My God, it's full of stars... and ClojureScript

I just open sourced a portion of a perpetually half-finished ClojureScript game.  The project uses ClojureScript, canvas, and affine transforms to build an interactive star map. The star map supports panning and zooming (click and drag to pan, scroll in or out to zoom).
The notable bits of the codebase are the canvas namespace and the affine transform namespace. My implementation of the star map was borrowed from an earlier incarnation I once did in Java 2D. I think there's an opportunity for a nice ClojureScript/Clojure library that would accept the same 2D drawing and transform operations for either a browser canvas element or a Java Graphics object.
To see the map in action:
To see the project page:

Tuesday, January 22, 2013

Clojure goes to Washington

Congressional Partisanship (by roll-call votes)
A year or two ago I created a project for tracking congressional partisanship over time using roll-call votes. More recently I rebuilt the project using Clojure. Even more recently I added a page to explain and show off the results.

I've open sourced it for whoever is interested. There's some code that uses Enlive to scrape official vote data for the House and Senate. There's also a bit of code for transforming the raw data into interesting metrics. And finally, there's a process for transforming the daily metrics into a moving average and exports it to a dygraph friendly format.

All in all, it was a fun little project that gave me the chance to toy with few new (at least to me) libraries and practice my Clojure. It also confirmed the (perhaps obvious) observation that congressional politics has been remarkably nasty in recent years.