Showing posts with label clojure. Show all posts
Showing posts with label clojure. Show all posts

Wednesday, August 16, 2017

Friday, January 16, 2015

cljx-sampling: A Clojure(script) library for sampling and random numbers

For a current hobby project I need the ability to generate seeded random numbers and/or sample items from collections in either the JVM or the browser. The PPRNG lib offers seedable random numbers, but it uses different generators depending on the environment. Given a seed I want to generate the same sequence of numbers regardless where the code is running.

So I've open-sourced a little library, cljx-sampling, that uses a seedable 32-bit Xorshift random number generator for consistent results in both Clojure and Clojurescript. I also reused some of my code from bigml/sampling and combined it with the Xorshift RNG to allow for convenient (and still consistent) in-memory samples over collections. Maybe you'll find it useful?

https://github.com/ashenfad/cljx-sampling

Wednesday, September 3, 2014

Sketching/hashing Algorithms in Clojure

Just a short note that I (and BigML) have open sourced a library of hashing / sketching based stream summarizers for Clojure.

Specifically, the library includes techniques that take streams of items and return summaries that can be queried for set membership (bloom filters), set similarity (min-hashes), item occurrence counts (count-min sketches), and the number of distinct items (with my favorite, the magical HyperLogLog).

This library was largely an educational exercise for me, as I wanted to better understand the world of streaming summaries for categorical data.  It's written in almost pure Clojure and backed by plain Clojure data structures.  So it's (hopefully!) easy to use and easy to serialize.  All the summaries are merge friendly making them a nice fit for distributed settings.  The big caveat is that I didn't spend much effort optimizing for speed.  Those in need of maximizing every CPU cycle may need to look elsewhere.

Thursday, February 21, 2013

Streaming histograms... faster

I just pushed v3.1.0 of the BigML streaming histogram library. For those stumbling across this blog, the library provides an embellished implementation of the histograms introduced by Ben Haim in JMLR. It's a handy way to compress a stream of data using limited memory. The algorithm is online/streaming, so it only needs on pass over the data and it can always give you an estimate of the distribution so far.

I recently noticed that Hive includes a simple implementation. That makes sense. The histograms are merge friendly, so they're a good tool for distributed systems. But Hive's version was much faster than mine for small histograms. That bugged me. :P

Hive uses a simple array to back the histogram. That's bad for histograms with lots of bins, since inserting a point costs O(n) with respect to # of bins in the histogram. My histograms had a better time order, O(logn), but also more overhead. Version 3.1.0 includes both approaches and switches between them depending on your histogram.

Anyway, it's a Java library but with lots of bells and whistles for Clojure devs. If you deal with streaming data on the JVM, give it a look!


Saturday, January 26, 2013

My God, it's full of stars... and ClojureScript



I just open sourced a portion of a perpetually half-finished ClojureScript game.  The project uses ClojureScript, canvas, and affine transforms to build an interactive star map. The star map supports panning and zooming (click and drag to pan, scroll in or out to zoom).
The notable bits of the codebase are the canvas namespace and the affine transform namespace. My implementation of the star map was borrowed from an earlier incarnation I once did in Java 2D. I think there's an opportunity for a nice ClojureScript/Clojure library that would accept the same 2D drawing and transform operations for either a browser canvas element or a Java Graphics object.
Anyway...
To see the map in action: http://ashenfad.github.com/stars/
To see the project page: https://github.com/ashenfad/stars

Tuesday, January 22, 2013

Clojure goes to Washington

Congressional Partisanship (by roll-call votes)
A year or two ago I created a project for tracking congressional partisanship over time using roll-call votes. More recently I rebuilt the project using Clojure. Even more recently I added a page to explain and show off the results.

I've open sourced it for whoever is interested. There's some code that uses Enlive to scrape official vote data for the House and Senate. There's also a bit of code for transforming the raw data into interesting metrics. And finally, there's a process for transforming the daily metrics into a moving average and exports it to a dygraph friendly format.

All in all, it was a fun little project that gave me the chance to toy with few new (at least to me) libraries and practice my Clojure. It also confirmed the (perhaps obvious) observation that congressional politics has been remarkably nasty in recent years.


Monday, January 21, 2013

(+ clojure big-ml)

I've been working on a Clojure library for the BigML machine learning API and we recently released it into the wild.


Thursday, June 21, 2012

Reservoir Sampling in Clojure

Update

I now have a whole fancy library for random sampling in Clojure, including reservoir sampling with lots of extras (with and without replacement, weighting, seeds). It's available here:
https://github.com/bigmlcom/sampling

A little snippet of Clojure for reservoir sampling:

Friday, June 10, 2011

Streaming (one-pass) Sampling With Replacement

Update

I now have a whole fancy library for random sampling in Clojure, including stream sampling and reservoir sampling with lots of extras (with and without replacement, weighting, seeds). It's available here:
https://github.com/bigmlcom/sampling


Recently I've been looking at sampling techniques for very large or streaming datasets. Specifically, I needed an algorithm that could perform random sampling with replacement during a single pass over a dataset too large to fit into memory.

A bit of searching led me a method that accomplishes just that using a dynamic sample reservoir. However, I wanted the ability to generate very large samples. Just like the original dataset, the sample might not fit into memory. So reservoirs were out.

Fortunately my problem had a simplification that made things much easier.  The reservoir approach assumed that the size of the original dataset was unknown. Not true in my case. Before I start the sample I'll know whether the dataset has a million instances or a billion.

This piece of information lets me cheat. Given the overall population size (the number of instances in the original dataset) and the intended sample size, I can calculate the occurrence probabilities for a single instance. I can find how likely an instance is to be sampled once, twice, three times, etc.


Once I have that probability distribution I can iterate over the original dataset. For each instance I just roll the die to determine how many times to sample it.

To calculate the occurrence probabilities, I use the following equation.  Let x be the occurrences, l be the sample size, and n the original population size:


In essence, this equation is taking the probability of an instance being selected x times, not selected (l-x) times, and multiplying it all by total number of ways this could happen.

To build a full distribution I calculate the probability of each occurrence starting at 0 and going up until I've captured nearly all the probability mass. This seems to works quite nicely, even for wacky situations like over-sampling the original population.


These are the strong points of this method:
  • Single pass sampling with replacement for any size dataset without any significant memory requirements
  • The original dataset can be split into chunks, sampled separately, and then recombined (perfect for map-reduce)
  • More than one sample set could be built in the same pass over the original data

But with these caveats:
  • Must know the size of the original dataset
  • The final size of the sample set will be near the intended size - but will be a little bigger or smaller depending on lady luck
  • The order of the sampled data won't be randomized

And finally, my Clojure implementation of this technique:

Friday, May 13, 2011

Another Clojure snippet

In my continuing adventure learning Clojure, I've put together a bit of code that takes a collection of strings and filters any that are contained in another string.

To whoever is interested:

Saturday, May 7, 2011

Risk Roll'n (with Clojure)

Once upon a time my frequent and unlucky defeats while playing Risk led me to build super simple Monte Carlo-style simulator for Risk battles. It didn't help me win any more often, but at least I could complain about my losses by showing just how unlikely they were.  It was a consolation prize of sorts.

Recently I've been learning Clojure with the intent of trying out Cascalog for simplifying Hadoop workflows. It's been ages since I've programmed in a functional language, so I wanted a bite sized project to help get comfy with Clojure. The Risk battle simulator seemed like a good fit - so here it is rebuilt with Clojure-y goodness.


Like I said, I'm new to Clojure, so don't assume any of that is idiomatic. Nonetheless, the code is pretty concise compared to my Java implementation and the slick Incanter library made it easy to chart the distribution of outcomes for a battle.

Running "(view-outcomes {:attackers 10 :defenders 10} 1000000)" will compute and chart the outcomes of a million 10 vs. 10 risk battles. The magnitude on the outcome axis indicates how many armies are left, the negative range means the defender wins and positive means the attacker wins.


So the next time your 20 armies are destroyed by 12 measly defenders, you can point at this chart and complain with accuracy: