Tuesday, September 6, 2016


Is this thing still on?  Just a tiny post to my most recent side project, PDX Viz.  It's a way to map/visualize cycling, ped, and transit connectivity around Portland.

Friday, January 16, 2015

cljx-sampling: A Clojure(script) library for sampling and random numbers

For a current hobby project I need the ability to generate seeded random numbers and/or sample items from collections in either the JVM or the browser. The PPRNG lib offers seedable random numbers, but it uses different generators depending on the environment. Given a seed I want to generate the same sequence of numbers regardless where the code is running.

So I've open-sourced a little library, cljx-sampling, that uses a seedable 32-bit Xorshift random number generator for consistent results in both Clojure and Clojurescript. I also reused some of my code from bigml/sampling and combined it with the Xorshift RNG to allow for convenient (and still consistent) in-memory samples over collections. Maybe you'll find it useful?


Wednesday, September 3, 2014

Sketching/hashing Algorithms in Clojure

Just a short note that I (and BigML) have open sourced a library of hashing / sketching based stream summarizers for Clojure.

Specifically, the library includes techniques that take streams of items and return summaries that can be queried for set membership (bloom filters), set similarity (min-hashes), item occurrence counts (count-min sketches), and the number of distinct items (with my favorite, the magical HyperLogLog).

This library was largely an educational exercise for me, as I wanted to better understand the world of streaming summaries for categorical data.  It's written in almost pure Clojure and backed by plain Clojure data structures.  So it's (hopefully!) easy to use and easy to serialize.  All the summaries are merge friendly making them a nice fit for distributed settings.  The big caveat is that I didn't spend much effort optimizing for speed.  Those in need of maximizing every CPU cycle may need to look elsewhere.

Thursday, February 21, 2013

Streaming histograms... faster

I just pushed v3.1.0 of the BigML streaming histogram library. For those stumbling across this blog, the library provides an embellished implementation of the histograms introduced by Ben Haim in JMLR. It's a handy way to compress a stream of data using limited memory. The algorithm is online/streaming, so it only needs on pass over the data and it can always give you an estimate of the distribution so far.

I recently noticed that Hive includes a simple implementation. That makes sense. The histograms are merge friendly, so they're a good tool for distributed systems. But Hive's version was much faster than mine for small histograms. That bugged me. :P

Hive uses a simple array to back the histogram. That's bad for histograms with lots of bins, since inserting a point costs O(n) with respect to # of bins in the histogram. My histograms had a better time order, O(logn), but also more overhead. Version 3.1.0 includes both approaches and switches between them depending on your histogram.

Anyway, it's a Java library but with lots of bells and whistles for Clojure devs. If you deal with streaming data on the JVM, give it a look!