Saturday, July 23, 2011

Genetic Gerrymandering - Oregon

So I automated gerrymandering. It's not the most noble side-project I've worked on. But it's interesting how a few simple rules can emulate devious human behavior. Read on to find out how. :)

Every ten years the US Census Bureau counts up the nation's population. With these numbers congressional seats are reallocated among states and each state redraws the boundaries of their congressional districts. All this is to ensure that citizens have, more or less, equal representation in the House of Representatives.

Maintaining equal representation is a good thing, but state legislatures sometimes take advantage of it to ensure the party in power stays in power. This process, called gerrymandering, is accomplished through "packing and cracking" populations so that one party will win a disproportionate number of elections (compared to their actual public support).

My long term goal is to grade and rank each state's districting objectively (well, objectively according to criteria that I'll make up *shrug*).  I decided, for my first step, to try my own hand at gerrymandering. I picked my home state of Oregon as the guinea pig.


The US Census Bureau counts up populations in small geographic areas called census tracts. These tracts are combined to create congressional districts. Fortunately the tract shapes are published online, so I grabbed the 756 Oregon tracts and converted them into a KML file shown above.

With the shape information I built a graph representation of Oregon's tracts. In the graph each node represents a tract and each edge represents a shared border between two tracts. However, not all shared borders qualify for an edge. A shared border must make up a significant portion of each tracts' overall border. The larger the tracts, the longer the required border. If I encounter any island tracts (not a problem for Oregon), I build the minimum spanning tree.

The image below shows the tract graph overlaid on the tracts for Corvallis, my home town. The dashed line is a shared border that wasn't large enough to make the cut.


With the tracts represented as a graph, creating a set of districts becomes a graph partitioning problem. The goal, though, is to partition it so that one political party wins the maximum number of districts. That meant I needed two more pieces of information, the population and the political affiliation for each census tract.

The US Census Bureau kindly makes populations for each tract publicly available. Darren (my geeky partner in crime) cleaned it up and posted it here. Unfortunately I ran my tests before it was ready. Instead I was lazy and made up my own population numbers. It's not too bad, though. The Census Bureau tries to put around 4000-5000 people in each tract. I set each tract's population near the upper part of that range.  My version of Oregon ends up with roughly 3.6 million residents. It's a decent, if not perfect approximation.

Getting the political affiliation is tricky. I had a naive optimism that the voting records for each tract might be public and easily accessible. Not so much. Public records exist for voting districts but those voting districts need to be mapped back to tracts. Even worse, the data is in ad hoc formats from county to county. Collecting it could be a full time job.

So I fudged it and made up my own political affiliations. Around Oregon, urban areas lean more Democratic than rural areas. I ran with that idea and used population density a proxy for political affiliation. I set the most urban tracts to have a 3-1 Democrat to Republican ratio, while the most rural tracts have a 1-3 ratio. That's pretty polarized but if Nate Silver is correct, Oregon may be the most polarized state in the nation.

The image below shows a map of the tracts in the Portland and Salem area shaded by political affiliation. As you'd expect, Democrat friendly tracts are blue, Republican leaning tracts are red, and mixed tracts are shades of purple. Looking at the map of the state, Oregon is mostly red with scattered but dense blue population centers. Overall I gave Oregon a 52%-48% lean for the Democrats, which seemed fair after our most recent gubernatorial race.


At this point I had all the information necessary to gerrymander Oregon. I just needed to settle on how. The rules were simple. I had 756 tracts that needed to be arranged into 5 districts (what I call a district set) and each district had to have about the same population. The more districts won by the target party (Dem or Repub) the better. Moreover, the bigger the margin of support in a won district the better.

Since calculating the optimal set of districts is a hard problem, I went for a heuristic search instead. Specifically, I used a genetic search. The basic idea is to start with some number of random, but valid, district sets. Take each district set and make some random small changes to the district borders. This creates a "child" district set. After generating some children for each parent district set, choose a few of the best ones. This is the "survival of the fittest" stage. Repeat the whole thing, generation after generation, until we get some solutions we're happy with.

To do a genetic search, I needed three things. First, I had to be able to generate random but valid district sets. Second, I needed a way to create small (but still valid) changes to a district set for creating children. Lastly, I needed a scoring function that would let me intelligently pick the which district sets should survive.

I'll skip the details on the first two tasks. That was just a bit hackery. The scoring function is a little more interesting (admittedly I might be the only person on the planet who thinks that). The simplest way to score would be to add up the number of winning districts. There are two problems with that, though. Margin of victory is important. A paltry 51% support for your party is unsafe come election time. Also, the genetic search does better when the scoring function is continuous rather than discrete. This way small changes can give slightly better scores and help guide our search to success.

To pull this off I used a sigmoid function. Specifically, I used the cumulative distribution function of a normal curve with a mean of 0.5 and a standard deviation of 0.05. It gives the nice S shaped curve below. The plot illustrates how the score increases as a district's party support increases (from a min of 0.25 to a max of 0.75).


That simple curve was all the genetic algorithm needed to discover the dubious human inventions of "packing and cracking". I ran two experiments, gerrymandering Oregon once for the Democrats and once for the Republicans. Each experiment took about six minutes on my laptop. That was time enough for the genetic search algorithm produce 100 generations and consider over 20,000 ways to partition the state.



District
Color
Pro Dem
1
67.7%
2
57.3%
3
35.6%
4
54.7%
5
57.3%

In this scenario, the Democrats take four of the five districts. District 3 is the Republican concession district. It covers the vast majority of state's area by swallowing up Oregon's most rural and red friendly tracts. District 2 covers most of the mid-Willamette Valley, turning Dem-friendly thanks to the population centers of Salem, Eugene/Springfield, and Albany/Corvallis. Districts 1, 4, and 5 each take a chunk of Portland's Democratic base to finish the Democratic dominance (see the next image). Those familiar with Oregon might notice that this fictitious set of districts has a quite a resemblance to Oregon's actual congressional districts. Coincidence? :P


In the second experiment, the state is divvied up to be Republican friendly. This time around the Republicans win four districts, leaving only one for the Democrats.


District
Color
Pro Repub
1
54.1%
2
53.9%
3
52.1%
4
25.9%
5
54.5%

In this scenario, district 4 is the only Democratic holdout. It's not even visible on the previous map - so let's zoom in:


District 4 jams in as many Portland Democrats as possible. This is textbook "packing", leaving the rest of Portland weak enough to divide and conquer by districts 5 and 2.

Not to be left out of the fun, districts 1 and 3 pull off a textbook "cracking" scenario by splitting up the Eugene/Springfield metro area.


So there we are - a fairly simple genetic search that can gerrymander nearly as well as us humans. Like I mentioned earlier, I'm hoping to morph this into a way to grade states based on their actual districting. Until then, this is really just a way to automate dirty politics - and I don't think any help is needed.