Too Many NumLumps

Saturday, July 23, 2011

Genetic Gerrymandering - Oregon

So I automated gerrymandering. It's not the most noble side-project I've worked on. But it's interesting how a few simple rules can emulate devious human behavior. Read on to find out how. :)

Every ten years the US Census Bureau counts up the nation's population. With these numbers congressional seats are reallocated among states and each state redraws the boundaries of their congressional districts. All this is to ensure that citizens have, more or less, equal representation in the House of Representatives.

Maintaining equal representation is a good thing, but state legislatures sometimes take advantage of it to ensure the party in power stays in power. This process, called gerrymandering, is accomplished through "packing and cracking" populations so that one party will win a disproportionate number of elections (compared to their actual public support).

My long term goal is to grade and rank each state's districting objectively (well, objectively according to criteria that I'll make up *shrug*). I decided, for my first step, to try my own hand at gerrymandering. I picked my home state of Oregon as the guinea pig.

Streaming (one-pass) Sampling With Replacement

Update

I now have a whole fancy library for random sampling in Clojure, including stream sampling and reservoir sampling with lots of extras (with and without replacement, weighting, seeds). It's available here:
https://github.com/bigmlcom/sampling

Recently I've been looking at sampling techniques for very large or streaming datasets. Specifically, I needed an algorithm that could perform random sampling with replacement during a single pass over a dataset too large to fit into memory.

A bit of searching led me a method that accomplishes just that using a dynamic sample reservoir. However, I wanted the ability to generate very large samples. Just like the original dataset, the sample might not fit into memory. So reservoirs were out.

Fortunately my problem had a simplification that made things much easier. The reservoir approach assumed that the size of the original dataset was unknown. Not true in my case. Before I start the sample I'll know whether the dataset has a million instances or a billion.

This piece of information lets me cheat. Given the overall population size (the number of instances in the original dataset) and the intended sample size, I can calculate the occurrence probabilities for a single instance. I can find how likely an instance is to be sampled once, twice, three times, etc.

Once I have that probability distribution I can iterate over the original dataset. For each instance I just roll the die to determine how many times to sample it.

To calculate the occurrence probabilities, I use the following equation. Let x be the occurrences, l be the sample size, and n the original population size:

In essence, this equation is taking the probability of an instance being selected x times, not selected (l-x) times, and multiplying it all by total number of ways this could happen.

To build a full distribution I calculate the probability of each occurrence starting at 0 and going up until I've captured nearly all the probability mass. This seems to works quite nicely, even for wacky situations like over-sampling the original population.

These are the strong points of this method:

Single pass sampling with replacement for any size dataset without any significant memory requirements
The original dataset can be split into chunks, sampled separately, and then recombined (perfect for map-reduce)
More than one sample set could be built in the same pass over the original data

But with these caveats:

Must know the size of the original dataset
The final size of the sample set will be near the intended size - but will be a little bigger or smaller depending on lady luck
The order of the sampled data won't be randomized

And finally, my Clojure implementation of this technique:

Friday, May 13, 2011

Another Clojure snippet

In my continuing adventure learning Clojure, I've put together a bit of code that takes a collection of strings and filters any that are contained in another string.

To whoever is interested:

Saturday, May 7, 2011

Risk Roll'n (with Clojure)

Once upon a time my frequent and unlucky defeats while playing Risk led me to build super simple Monte Carlo-style simulator for Risk battles. It didn't help me win any more often, but at least I could complain about my losses by showing just how unlikely they were. It was a consolation prize of sorts.

Recently I've been learning Clojure with the intent of trying out Cascalog for simplifying Hadoop workflows. It's been ages since I've programmed in a functional language, so I wanted a bite sized project to help get comfy with Clojure. The Risk battle simulator seemed like a good fit - so here it is rebuilt with Clojure-y goodness.

Like I said, I'm new to Clojure, so don't assume any of that is idiomatic. Nonetheless, the code is pretty concise compared to my Java implementation and the slick Incanter library made it easy to chart the distribution of outcomes for a battle.

Running "(view-outcomes {:attackers 10 :defenders 10} 1000000)" will compute and chart the outcomes of a million 10 vs. 10 risk battles. The magnitude on the outcome axis indicates how many armies are left, the negative range means the defender wins and positive means the attacker wins.

So the next time your 20 armies are destroyed by 12 measly defenders, you can point at this chart and complain with accuracy: