<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-7717060942576029301</id><updated>2012-02-16T19:05:38.294-08:00</updated><category term='cloudera'/><category term='c4.5'/><category term='gists'/><category term='genetic algorithm'/><category term='map reduce'/><category term='clojure'/><category term='ec2'/><category term='whirr'/><category term='politics'/><category term='sampling with replacement'/><category term='topolitics'/><category term='streaming'/><category term='maverick'/><category term='partisanship'/><category term='ubuntu'/><category term='gerrymandering'/><category term='risk'/><category term='classification tree'/><category term='supervised learning'/><category term='monte carlo'/><category term='hadoop'/><title type='text'>Too Many NumLumps</title><subtitle type='html'>Adam Ashenfelter's Project Blog</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://ashenfad.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://ashenfad.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Adam</name><uri>http://www.blogger.com/profile/17499355093104727120</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-z7IbKEsJpFk/Th6kWqEWzSI/AAAAAAAAAc4/ePllTfpQKSM/s220/aa.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>9</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-7717060942576029301.post-7580008005268543533</id><published>2011-07-23T00:09:00.000-07:00</published><updated>2011-07-23T11:20:26.959-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='genetic algorithm'/><category scheme='http://www.blogger.com/atom/ns#' term='gerrymandering'/><title type='text'>Genetic Gerrymandering - Oregon</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;So I automated gerrymandering. It's not the most noble side-project I've worked on. But it's interesting how a few simple rules can emulate devious human behavior. Read on to find out how. :)&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;Every ten years the US Census Bureau counts up the nation's population. With these numbers congressional seats are reallocated among states and each state redraws the&amp;nbsp;boundaries&amp;nbsp;of their congressional districts. All this is&amp;nbsp;to ensure that citizens have, more or less, equal representation in the House of Representatives.&lt;br /&gt;&lt;br /&gt;Maintaining equal representation is a good thing, but state legislatures sometimes take advantage of it to ensure the party in power stays in power. This process, called&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/Gerrymandering"&gt;gerrymandering&lt;/a&gt;, is accomplished through &lt;a href="http://en.wikipedia.org/wiki/Gerrymandering#Packing_and_cracking"&gt;"packing and cracking"&lt;/a&gt; populations so that one party will win a&amp;nbsp;disproportionate number of elections (compared to their actual public support).&lt;br /&gt;&lt;br /&gt;My long term goal is to grade and rank each state's districting objectively (well, objectively according to criteria that I'll make up *shrug*). &amp;nbsp;I decided, for my first step, to try my own hand at gerrymandering. I picked my home state of Oregon as the guinea pig.&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a href="http://goo.gl/WKJBr"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-y01aanDzUOk/TiKofU-5nvI/AAAAAAAAAd8/6_dbgoacACM/s1600/tracts.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;The US Census Bureau counts up populations in small geographic areas called &lt;i&gt;census tracts&lt;/i&gt;. These tracts are combined to create congressional districts. Fortunately the tract shapes are published online, so I grabbed the 756 Oregon tracts and converted them into a &lt;a href="http://goo.gl/WKJBr"&gt;KML file&lt;/a&gt;&amp;nbsp;shown above.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: -webkit-auto;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: -webkit-auto;"&gt;With the shape information I built a graph representation of Oregon's tracts. In the graph each node represents a tract and each edge represents a shared border between two tracts. However, not all shared borders qualify for an edge. A shared border must make up a significant portion of each tracts' overall border. The larger the tracts, the longer the required border. If I encounter any island tracts (not a problem for Oregon), I build the &lt;a href="http://en.wikipedia.org/wiki/Minimum_spanning_tree"&gt;minimum spanning tree&lt;/a&gt;.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: -webkit-auto;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: -webkit-auto;"&gt;The image below shows the tract graph&amp;nbsp;overlaid&amp;nbsp;on the tracts for Corvallis, my home town. The dashed line is a shared border that wasn't large enough to make the cut.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-D17ZRidm6xU/TiKDdzsfnsI/AAAAAAAAAds/eBVqjJyES5c/s1600/tractGraph.png" /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;With the tracts represented as a graph, creating a set of districts becomes a graph partitioning problem. The goal, though, is to partition it so that one political party wins the maximum number of districts. That meant I needed two more pieces of information, the population and the political&amp;nbsp;affiliation&amp;nbsp;for each census tract.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;The US Census Bureau kindly makes populations for each tract publicly available. &lt;a href="http://www.blogger.com/profile/13329419403544669398"&gt;Darren&lt;/a&gt; (my geeky partner in crime) cleaned it up and posted it &lt;a href="http://www.infochimps.com/datasets/american-community-survey-oregon-2005-2009-detailed-tables"&gt;here&lt;/a&gt;. Unfortunately I ran my tests before it was ready. Instead I was lazy and made up my own population numbers. It's not too bad, though. The Census Bureau tries to put around 4000-5000 people in each tract. I set each tract's population near the upper part of that range. &amp;nbsp;My version of Oregon ends up with roughly 3.6 million residents. It's a decent, if not perfect approximation.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Getting the political affiliation is tricky. I had a&amp;nbsp;naive&amp;nbsp;optimism that the voting records for each tract might be public and easily&amp;nbsp;accessible. Not so much. Public records exist for voting districts but those voting districts need to be mapped back to tracts. Even worse, the data is in ad hoc formats from county to county. Collecting it could be a full time job.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;So I fudged it and made up my own political affiliations. Around Oregon, urban areas lean more Democratic than rural areas. I ran with that idea and used population density a proxy for political affiliation. I set the most urban tracts to have a 3-1 Democrat to Republican ratio, while the most rural tracts have a 1-3 ratio. That's pretty polarized but if Nate Silver is correct, Oregon may be the &lt;a href="http://www.fivethirtyeight.com/2008/10/road-to-270-oregon.html"&gt;most polarized state in the nation&lt;/a&gt;.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;The image below shows a map of the tracts in the Portland and Salem area shaded by political affiliation. As you'd expect, Democrat friendly tracts are blue, Republican leaning tracts are red, and mixed tracts are shades of purple. Looking at the map of the state, Oregon is mostly red with scattered but dense blue population centers. Overall I gave Oregon a 52%-48% lean for the Democrats, which seemed fair after our most recent&amp;nbsp;gubernatorial&amp;nbsp;race.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-tZkna7ZbUWM/TiKEUiS3sAI/AAAAAAAAAdw/0OmesjS0M9k/s1600/tractAffiliation.png" /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;At this point I had all the information necessary to gerrymander Oregon. I just needed to settle on how. The rules were simple. I had 756 tracts that needed to be arranged into 5 districts (what I call a district set) and each district had to have about the same population. The more districts won by the target party (Dem or Repub) the better. Moreover, the bigger the margin of support in a won district the better.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Since calculating the optimal set of districts is a hard problem, I went for a heuristic search instead. Specifically, I used a &lt;a href="http://en.wikipedia.org/wiki/Genetic_algorithm"&gt;genetic search&lt;/a&gt;. The basic idea is to start with some number of random, but valid, district sets. Take each district set and make some random small changes to the district borders. This creates a "child" district set. After generating some children for each parent district set, choose a few of the best ones. This is the "survival of the fittest" stage. Repeat the whole thing, generation after generation, until we get some solutions we're happy with.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;To do a genetic search, I needed three things. First, I had to be able to generate random but valid district sets. Second, I needed a way to create small (but still valid) changes to a district set for creating children. Lastly, I needed a scoring function that would let me intelligently pick the which district sets should survive.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;I'll skip the details on the first two tasks. That was just a bit hackery. The scoring function is a little more interesting (admittedly I might be the only person on the planet who thinks that). The&amp;nbsp;simplest&amp;nbsp;way to score would be to add up the number of winning districts. There are two problems with that, though. Margin of victory is important. A paltry 51% support for your party is unsafe come election time. Also, the genetic search does better when the scoring function is continuous rather than discrete. This way small changes can give slightly better scores and help guide our search to success.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;To pull this off I used a &lt;a href="http://en.wikipedia.org/wiki/Sigmoid_function"&gt;sigmoid function&lt;/a&gt;. Specifically, I used the &lt;a href="http://en.wikipedia.org/wiki/Cumulative_distribution_function"&gt;cumulative distribution function&lt;/a&gt; of a &lt;a href="http://en.wikipedia.org/wiki/Normal_distribution"&gt;normal curve&lt;/a&gt; with a mean of &lt;i&gt;0.5&lt;/i&gt; and a standard deviation of &lt;i&gt;0.05&lt;/i&gt;. It gives the nice S shaped curve below. The plot&amp;nbsp;illustrates&amp;nbsp;how the score increases as a district's party support increases (from a min of 0.25 to a max of 0.75).&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-uMbEskeVa_E/TiNNUqjZhqI/AAAAAAAAAeM/ZMdrQ7dSp80/s1600/scoring.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-uMbEskeVa_E/TiNNUqjZhqI/AAAAAAAAAeM/ZMdrQ7dSp80/s1600/scoring.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;That simple curve was all the genetic algorithm needed to discover the dubious human inventions of "packing and cracking". I ran two experiments, gerrymandering Oregon once for the Democrats and once for the Republicans. Each experiment took about six minutes on my laptop. That was time enough for the genetic search algorithm produce 100 generations and consider over 20,000 ways to partition the state.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;The image below shows the &lt;a href="http://goo.gl/LbfNl"&gt;best set of districts found for the Democrats&lt;/a&gt;.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://goo.gl/LbfNl"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-XZIuURq2Opo/TiKodfyVJUI/AAAAAAAAAd0/tPD__vw9q-E/s1600/forBlue.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;table align="center" bgcolor="ffffff" border="1" bordercolor="aaaaaa" cellpadding="5"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;th align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;District&lt;/div&gt;&lt;/th&gt;&lt;th align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;Color&lt;/div&gt;&lt;/th&gt;&lt;th align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;Pro Dem&lt;/div&gt;&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;1&lt;/div&gt;&lt;/td&gt;&lt;td align="center" bgcolor="809a86"&gt;&lt;/td&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="color: #3d85c6;"&gt;67.7%&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;2&lt;/div&gt;&lt;/td&gt;&lt;td align="center" bgcolor="c574b8"&gt;&lt;/td&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="color: #3d85c6;"&gt;57.3%&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;3&lt;/div&gt;&lt;/td&gt;&lt;td align="center" bgcolor="67c4d1"&gt;&lt;/td&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="color: #cc0000;"&gt;35.6%&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;4&lt;/div&gt;&lt;/td&gt;&lt;td align="center" bgcolor="d0625d"&gt;&lt;/td&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="color: #3d85c6;"&gt;54.7%&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;5&lt;/div&gt;&lt;/td&gt;&lt;td align="center" bgcolor="577f8b"&gt;&lt;/td&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="color: #3d85c6;"&gt;57.3%&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;In this scenario, the Democrats take four of the five districts. District 3 is the Republican concession district. It covers the vast majority of state's area by swallowing up Oregon's most rural and red friendly tracts. District 2 covers most of the mid-Willamette Valley, turning Dem-friendly thanks to the population centers of Salem, Eugene/Springfield, and Albany/Corvallis. Districts 1, 4, and 5 &lt;a href="http://goo.gl/vn5Lv"&gt;each take a chunk of Portland's Democratic base&lt;/a&gt; to finish the Democratic dominance (see the next image).&amp;nbsp;Those familiar with Oregon might notice that this&amp;nbsp;fictitious set of districts&amp;nbsp;has a quite a&amp;nbsp;resemblance to &lt;a href="http://en.wikipedia.org/wiki/Oregon's_congressional_districts"&gt;Oregon's actual congressional districts&lt;/a&gt;. Coincidence? :P&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://goo.gl/vn5Lv"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-cvcxh-4ZeZg/TiMmWZym7QI/AAAAAAAAAeA/ncrrCPv_PJ8/s1600/Screen+shot+2011-07-17+at+11.05.58+AM.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;In the &lt;a href="http://goo.gl/4dlcQ"&gt;second&amp;nbsp;experiment&lt;/a&gt;, the state is divvied up to be Republican friendly. This time around the Republicans win four&amp;nbsp;districts, leaving only one for the Democrats.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://goo.gl/4dlcQ"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-GSdvBu7uzCI/TiKodSKg8hI/AAAAAAAAAd4/78YQzZHctn8/s1600/forRed.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;table align="center" bgcolor="ffffff" border="1" bordercolor="aaaaaa" cellpadding="5"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;th align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;District&lt;/div&gt;&lt;/th&gt;&lt;th align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;Color&lt;/div&gt;&lt;/th&gt;&lt;th align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;Pro Repub&lt;/div&gt;&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;1&lt;/div&gt;&lt;/td&gt;&lt;td align="center" bgcolor="809a86"&gt;&lt;/td&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="color: #cc0000;"&gt;54.1%&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;2&lt;/div&gt;&lt;/td&gt;&lt;td align="center" bgcolor="c574b8"&gt;&lt;/td&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="color: #cc0000;"&gt;53.9%&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;3&lt;/div&gt;&lt;/td&gt;&lt;td align="center" bgcolor="67c4d1"&gt;&lt;/td&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="color: #cc0000;"&gt;52.1%&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;4&lt;/div&gt;&lt;/td&gt;&lt;td align="center" bgcolor="d0625d"&gt;&lt;/td&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="color: #3d85c6;"&gt;25.9%&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;5&lt;/div&gt;&lt;/td&gt;&lt;td align="center" bgcolor="577f8b"&gt;&lt;/td&gt;&lt;td align="center"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="color: #cc0000;"&gt;54.5%&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;In this scenario, district 4 is the only Democratic holdout. It's not even visible on the previous map - so let's &lt;a href="http://goo.gl/e5HZY"&gt;zoom in&lt;/a&gt;:&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://goo.gl/e5HZY"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-JdInnCTxQHE/TiMmeGrwjQI/AAAAAAAAAeI/GSzvGcVZPM4/s1600/Screen+shot+2011-07-17+at+11.05.32+AM.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;District 4 jams in as many Portland Democrats as possible. This is textbook "packing", leaving the rest of Portland weak enough to divide and conquer by districts 5 and 2.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Not to be left out of the fun, districts 1 and 3 pull off a textbook "cracking" scenario by &lt;a href="http://goo.gl/2H5ZK"&gt;splitting up the Eugene/Springfield metro area&lt;/a&gt;.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://goo.gl/2H5ZK"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-73VLnZjelh0/TiMmdzbWR4I/AAAAAAAAAeE/XkPEK_61eTo/s1600/Screen+shot+2011-07-17+at+11.05.12+AM.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;So there we are - a fairly simple genetic search that can gerrymander nearly as well as us humans. Like I mentioned earlier, I'm hoping to morph this into a way to grade states based on their actual districting. Until then, this is really just a way to automate dirty politics - and I don't think any help is needed.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7717060942576029301-7580008005268543533?l=ashenfad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ashenfad.blogspot.com/feeds/7580008005268543533/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://ashenfad.blogspot.com/2011/07/genetic-gerrymandering-oregon.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default/7580008005268543533'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default/7580008005268543533'/><link rel='alternate' type='text/html' href='http://ashenfad.blogspot.com/2011/07/genetic-gerrymandering-oregon.html' title='Genetic Gerrymandering - Oregon'/><author><name>Adam</name><uri>http://www.blogger.com/profile/17499355093104727120</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-z7IbKEsJpFk/Th6kWqEWzSI/AAAAAAAAAc4/ePllTfpQKSM/s220/aa.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-y01aanDzUOk/TiKofU-5nvI/AAAAAAAAAd8/6_dbgoacACM/s72-c/tracts.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7717060942576029301.post-8297384635350093220</id><published>2011-06-10T01:28:00.000-07:00</published><updated>2011-06-14T10:26:26.367-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='streaming'/><category scheme='http://www.blogger.com/atom/ns#' term='map reduce'/><category scheme='http://www.blogger.com/atom/ns#' term='clojure'/><category scheme='http://www.blogger.com/atom/ns#' term='sampling with replacement'/><title type='text'>Streaming (one-pass) Sampling With Replacement</title><content type='html'>Recently I've been looking at sampling techniques for very large or streaming datasets. Specifically, I needed an algorithm that could perform &lt;a href="http://www.ma.utexas.edu/users/parker/sampling/repl.htm"&gt;random sampling with replacement&lt;/a&gt; during a single pass over a dataset too large to fit into memory.&lt;br /&gt;&lt;br /&gt;A bit of searching led me a method that accomplishes just that using a &lt;a href="http://www.siam.org/proceedings/datamining/2004/dm04_053parkb.pdf"&gt;dynamic sample reservoir&lt;/a&gt;. However, I wanted the ability to generate very large samples. Just like the original dataset, the sample might not fit into memory. So reservoirs were out.&lt;br /&gt;&lt;br /&gt;Fortunately my problem had a simplification that made things much easier. &amp;nbsp;The reservoir approach assumed that the size of the original dataset was unknown. Not true in my case. Before I start the sample I'll know whether the dataset has a million instances or a billion.&lt;br /&gt;&lt;br /&gt;This piece of information lets me cheat. Given the overall population size (the number of instances in the original dataset) and the intended sample size, I can calculate the&amp;nbsp;occurrence&amp;nbsp;probabilities for a single instance. I can find how likely an instance is to be sampled once, twice, three times, etc.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-ChCz930-y-M/TfHJFGs-UEI/AAAAAAAAARU/RELFKiQ4x_k/s1600/3of5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-ChCz930-y-M/TfHJFGs-UEI/AAAAAAAAARU/RELFKiQ4x_k/s1600/3of5.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Once I have that&amp;nbsp;probability&amp;nbsp;distribution I can iterate over the original dataset. For each instance I just roll the die to determine how many times to sample it.&lt;br /&gt;&lt;br /&gt;To calculate the&amp;nbsp;occurrence&amp;nbsp;probabilities, I use the following equation. &amp;nbsp;Let &lt;i&gt;&lt;b&gt;x&lt;/b&gt;&lt;/i&gt; be the&amp;nbsp;occurrences, &lt;i&gt;&lt;b&gt;l&lt;/b&gt;&lt;/i&gt; be the sample size, and &lt;i&gt;&lt;b&gt;n&lt;/b&gt;&lt;/i&gt; the original population size:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://mathurl.com/6ku88e8"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-HZIVe29JVzw/TfHJ0o0Ba6I/AAAAAAAAARY/YTB_iSZFKCs/s1600/occuranceProb.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;In essence, this equation is taking the probability of an instance being selected&amp;nbsp;&lt;b&gt;&lt;i&gt;x&lt;/i&gt;&lt;/b&gt; times, not selected&amp;nbsp;&lt;i&gt;&lt;b&gt;(l-x)&lt;/b&gt;&lt;/i&gt;&amp;nbsp;times, and multiplying it all by total number of&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/Combination"&gt;ways this could happen&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;To build a full distribution I calculate the probability of each occurrence starting at 0 and going up until I've captured nearly all the probability mass. This seems to works quite nicely, even for wacky situations like over-sampling the original population.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-nxehfcUCS7w/TfHPO29uhpI/AAAAAAAAARc/N-SzhO-r0wo/s1600/500of100.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-nxehfcUCS7w/TfHPO29uhpI/AAAAAAAAARc/N-SzhO-r0wo/s1600/500of100.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;These are the strong points of this method:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Single pass sampling with replacement for any size dataset without any significant memory requirements&lt;/li&gt;&lt;li&gt;The original dataset can be split into chunks, sampled separately, and then recombined (perfect for map-reduce)&lt;/li&gt;&lt;li&gt;More than one sample set could be built in the same pass over the original data&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;But with these caveats:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Must know the size of the original dataset&lt;/li&gt;&lt;li&gt;The final size of the sample set will be near the intended size - but will be a little bigger or smaller depending on lady luck&lt;/li&gt;&lt;li&gt;The order of the sampled data won't be randomized&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;And finally, my Clojure implementation of this technique:&lt;br /&gt;&lt;br /&gt;&lt;script src="https://gist.github.com/979783.js?file=sample-replacement.clj"&gt;&lt;/script&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7717060942576029301-8297384635350093220?l=ashenfad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ashenfad.blogspot.com/feeds/8297384635350093220/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://ashenfad.blogspot.com/2011/06/single-pass-sampling-with-replacement.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default/8297384635350093220'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default/8297384635350093220'/><link rel='alternate' type='text/html' href='http://ashenfad.blogspot.com/2011/06/single-pass-sampling-with-replacement.html' title='Streaming (one-pass) Sampling With Replacement'/><author><name>Adam</name><uri>http://www.blogger.com/profile/17499355093104727120</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-z7IbKEsJpFk/Th6kWqEWzSI/AAAAAAAAAc4/ePllTfpQKSM/s220/aa.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-ChCz930-y-M/TfHJFGs-UEI/AAAAAAAAARU/RELFKiQ4x_k/s72-c/3of5.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7717060942576029301.post-7154415331458289272</id><published>2011-05-13T11:02:00.000-07:00</published><updated>2011-05-13T11:02:31.458-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='clojure'/><category scheme='http://www.blogger.com/atom/ns#' term='gists'/><title type='text'>Another Clojure snippet</title><content type='html'>In my continuing adventure learning Clojure, I've put together a bit of code that takes a collection of strings and filters any that are contained in another string.&lt;br /&gt;&lt;br /&gt;To whoever is interested:&lt;br /&gt;&lt;br /&gt;&lt;script src="https://gist.github.com/970988.js?file=unique-tokens.clj"&gt;&lt;/script&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7717060942576029301-7154415331458289272?l=ashenfad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ashenfad.blogspot.com/feeds/7154415331458289272/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://ashenfad.blogspot.com/2011/05/another-clojure-snippet.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default/7154415331458289272'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default/7154415331458289272'/><link rel='alternate' type='text/html' href='http://ashenfad.blogspot.com/2011/05/another-clojure-snippet.html' title='Another Clojure snippet'/><author><name>Adam</name><uri>http://www.blogger.com/profile/17499355093104727120</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-z7IbKEsJpFk/Th6kWqEWzSI/AAAAAAAAAc4/ePllTfpQKSM/s220/aa.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7717060942576029301.post-7067893257248925551</id><published>2011-05-07T18:29:00.000-07:00</published><updated>2011-05-13T11:03:06.003-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='risk'/><category scheme='http://www.blogger.com/atom/ns#' term='monte carlo'/><category scheme='http://www.blogger.com/atom/ns#' term='clojure'/><category scheme='http://www.blogger.com/atom/ns#' term='gists'/><title type='text'>Risk Roll'n (with Clojure)</title><content type='html'>Once upon a time my frequent and unlucky defeats while playing &lt;a href="http://en.wikipedia.org/wiki/Risk_(game)"&gt;Risk&lt;/a&gt; led me to build super simple &lt;a href="http://en.wikipedia.org/wiki/Monte_Carlo_method"&gt;Monte Carlo-style&lt;/a&gt; simulator for Risk battles. It didn't help me win any more often, but at least I could complain about my losses by showing just how unlikely they were. &amp;nbsp;It was a consolation prize of sorts.&lt;br /&gt;&lt;br /&gt;Recently I've been learning &lt;a href="http://clojure.org/"&gt;Clojure&lt;/a&gt;&amp;nbsp;with the intent of trying out&amp;nbsp;&lt;a href="https://github.com/nathanmarz/cascalog"&gt;Cascalog&lt;/a&gt;&amp;nbsp;for&amp;nbsp;simplifying&amp;nbsp;&lt;a href="http://hadoop.apache.org/"&gt;Hadoop&lt;/a&gt; workflows. It's been ages since I've programmed in a functional language, so&amp;nbsp;I wanted a bite sized project to help get comfy with Clojure. The Risk battle simulator seemed like a good fit - so here it is rebuilt with Clojure-y goodness.&lt;br /&gt;&lt;br /&gt;&lt;script src="https://gist.github.com/959518.js?file=risk-roll.clj"&gt;&lt;/script&gt;&lt;br /&gt;Like I said, I'm new to Clojure, so don't assume any of that is idiomatic. Nonetheless, the code is pretty concise compared to my Java implementation and the slick &lt;a href="http://incanter.org/"&gt;Incanter&lt;/a&gt; library made it easy to chart the distribution of outcomes for a battle.&lt;br /&gt;&lt;br /&gt;Running "(view-outcomes {:attackers 10 :defenders 10} 1000000)" will compute and chart the outcomes of a million 10 vs. 10 risk battles. The magnitude on the outcome axis indicates how many armies are left, the&amp;nbsp;negative&amp;nbsp;range means the&amp;nbsp;defender&amp;nbsp;wins and positive means the attacker wins.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-WDGKVryPFWk/TcXssFKkdxI/AAAAAAAAAQ0/hmN07glf4H8/s1600/10v10_1000000.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-WDGKVryPFWk/TcXssFKkdxI/AAAAAAAAAQ0/hmN07glf4H8/s1600/10v10_1000000.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;So the next time your 20 armies are destroyed by 12 measly defenders, you can point at this chart and complain with accuracy:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-BDX6p-FWO0M/TcXxab99NTI/AAAAAAAAAQ4/4lhMRq44xpM/s1600/20v12_100000.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-BDX6p-FWO0M/TcXxab99NTI/AAAAAAAAAQ4/4lhMRq44xpM/s1600/20v12_100000.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7717060942576029301-7067893257248925551?l=ashenfad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ashenfad.blogspot.com/feeds/7067893257248925551/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://ashenfad.blogspot.com/2011/05/risk-rolln-with-clojure.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default/7067893257248925551'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default/7067893257248925551'/><link rel='alternate' type='text/html' href='http://ashenfad.blogspot.com/2011/05/risk-rolln-with-clojure.html' title='Risk Roll&apos;n (with Clojure)'/><author><name>Adam</name><uri>http://www.blogger.com/profile/17499355093104727120</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-z7IbKEsJpFk/Th6kWqEWzSI/AAAAAAAAAc4/ePllTfpQKSM/s220/aa.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-WDGKVryPFWk/TcXssFKkdxI/AAAAAAAAAQ0/hmN07glf4H8/s72-c/10v10_1000000.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7717060942576029301.post-6058161350351612043</id><published>2011-03-05T22:41:00.000-08:00</published><updated>2011-06-14T10:35:22.252-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='map reduce'/><category scheme='http://www.blogger.com/atom/ns#' term='supervised learning'/><category scheme='http://www.blogger.com/atom/ns#' term='classification tree'/><category scheme='http://www.blogger.com/atom/ns#' term='c4.5'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Decision Trees &amp; Hadoop - Part 3: Results</title><content type='html'>&lt;a href="http://ashenfad.blogspot.com/2010/12/decision-trees-in-hadoop-part-1-data.html"&gt;Part 1: Data&lt;/a&gt; | &lt;a href="http://ashenfad.blogspot.com/2010/12/decision-trees-hadoop-part-2-approach.html"&gt;Part 2: Approach&lt;/a&gt; | Part 3: Results&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;And now the (exciting?) conclusion of my adventures building a C4.5-like decision tree with Hadoop.&amp;nbsp;Since my previous post I've implemented the design, added a few new wrinkles, and collected the results using my Mandelbrot toy problem.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;I'm skipping the low level details, but the "new wrinkles" are two additional map-reduce tasks.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;One of the tasks triggers whenever the number of instances in a tree node drops below a predefined threshold. The threshold is set low enough that we know the instances can fit into the processes' memory. So we go ahead and grow the rest of that tree branch in the standard recursive manner, greatly reducing the time it takes to build the entire tree.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The second task simply searches for instances that belong to leaves of the tree (nodes we no longer consider for splitting) and removes them. This reduces the amount of data we need to evaluate in future iterations.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;I ran the Hadoop tree algorithm on the toy dataset I described in &lt;a href="http://ashenfad.blogspot.com/2010/12/decision-trees-in-hadoop-part-1-data.html"&gt;part 1&lt;/a&gt;&amp;nbsp;and compared it to the RepTree and RandomForest techniques from &lt;a href="http://www.cs.waikato.ac.nz/ml/weka/"&gt;Weka's&lt;/a&gt; collection. RepTree and the RandomForest outperformed HadoopTree for 10K, 100K, and 1M datasets, but failed to build on the 10M dataset (using a 2 GB JVM). The HadoopTree trained on the 10M dataset had the best overall accuracy.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;To give the HadoopTree fair competition I added 11&amp;nbsp;bagged RepTrees (&lt;a href="http://en.wikipedia.org/wiki/Bootstrap_aggregating"&gt;bagging&lt;/a&gt; being a more traditional way to tackle giant datasets). The bagged RepTree's performance was nearly an exact match of the HadoopTree's results.&lt;/div&gt;&lt;br /&gt;&lt;script src="//ajax.googleapis.com/ajax/static/modules/gviz/1.0/chart.js" type="text/javascript"&gt; {"chartType":"ColumnChart","chartName":"MandelbrotPrediction","dataSourceUrl":"//spreadsheets.google.com/tq?key=0Ah2oAcudnjP4dElFRFgxVk1EN2tmX3prVWppZFdUTFE&amp;range=A1%3AE5&amp;gid=4&amp;transpose=0&amp;headers=1&amp;pub=1","options":{"displayAnnotations":true,"showTip":true,"reverseCategories":false,"titleY":"Accuracy","dataMode":"markers","titleX":"Training Set Size","maxAlternation":1,"pointSize":"0","colors":["#3366CC","#DC3912","#FF9900","#109618","#990099","#0099C6","#DD4477","#66AA00","#B82E2E","#316395"],"smoothLine":false,"lineWidth":"2","labelPosition":"right","is3D":false,"logScale":false,"wmode":"opaque","hasLabelsColumn":true,"title":"Mandelbrot Set Prediction","legend":"right","allowCollapse":true,"cht":"bhg","reverseAxis":false,"isStacked":false,"mapType":"hybrid","width":655,"height":400},"packages":"corechart","refreshInterval":5} &lt;/script&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;In the second round of tests I used Mandelbrot data with 50 dummy numeric features. This made the dataset much larger and RepTree and RandomForest failed to build on both the 1M and 10M datasets. Random splits don't do well with a large number of junk features so, not surprisingly, the RandomForest's performance suffered on this dataset.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The HadoopTree trained on 10M instances carries the day, but a bag of 101 RepTrees comes in a close second. Considering that bagged smaller trees are much faster to grow and would do better on noisy data, the single HadoopTree is more novelty than practical. Although I'm not implementing it at the moment, I expect this technique would provide a bigger payoff if it were modified to grow boosted regression trees.&lt;/div&gt;&lt;br /&gt;&lt;script src="//ajax.googleapis.com/ajax/static/modules/gviz/1.0/chart.js" type="text/javascript"&gt; {"chartType":"ColumnChart","chartName":"Mandelbrot50Prediction","dataSourceUrl":"//spreadsheets.google.com/tq?key=0Ah2oAcudnjP4dElFRFgxVk1EN2tmX3prVWppZFdUTFE&amp;range=A1%3AE5&amp;gid=3&amp;transpose=0&amp;headers=1&amp;pub=1","options":{"displayAnnotations":true,"showTip":true,"reverseCategories":false,"titleY":"Accuracy","dataMode":"markers","titleX":"Training Set Size","maxAlternation":1,"pointSize":"0","colors":["#3366CC","#DC3912","#FF9900","#109618","#990099","#0099C6","#DD4477","#66AA00","#B82E2E","#316395"],"smoothLine":false,"lineWidth":"2","labelPosition":"right","is3D":false,"logScale":false,"wmode":"opaque","hasLabelsColumn":true,"title":"Mandelbrot Set Prediction w/ 50 Dummy Features","legend":"right","allowCollapse":true,"reverseAxis":false,"mapType":"hybrid","isStacked":false,"width":655,"height":400},"packages":"corechart","refreshInterval":5} &lt;/script&gt;&lt;br /&gt;&lt;br /&gt;See the full results &lt;a href="https://spreadsheets0.google.com/pub?hl=en&amp;amp;hl=en&amp;amp;key=0Ah2oAcudnjP4dElFRFgxVk1EN2tmX3prVWppZFdUTFE&amp;amp;output=html"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;The code is viewable &lt;a href="https://github.com/ashenfad/hadooptree"&gt;on GitHub&lt;/a&gt;, but be warned, it's still a toy project. I haven't serialized the output from the map-reduce tasks (so there's a lot more data transfer than there should be) or made a proper parameter/config file.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7717060942576029301-6058161350351612043?l=ashenfad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ashenfad.blogspot.com/feeds/6058161350351612043/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://ashenfad.blogspot.com/2011/03/decision-trees-hadoop-part-3-results.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default/6058161350351612043'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default/6058161350351612043'/><link rel='alternate' type='text/html' href='http://ashenfad.blogspot.com/2011/03/decision-trees-hadoop-part-3-results.html' title='Decision Trees &amp; Hadoop - Part 3: Results'/><author><name>Adam</name><uri>http://www.blogger.com/profile/17499355093104727120</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-z7IbKEsJpFk/Th6kWqEWzSI/AAAAAAAAAc4/ePllTfpQKSM/s220/aa.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7717060942576029301.post-6289859735831436483</id><published>2011-01-18T00:00:00.000-08:00</published><updated>2011-01-18T00:21:56.429-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='cloudera'/><category scheme='http://www.blogger.com/atom/ns#' term='maverick'/><category scheme='http://www.blogger.com/atom/ns#' term='map reduce'/><category scheme='http://www.blogger.com/atom/ns#' term='whirr'/><category scheme='http://www.blogger.com/atom/ns#' term='ec2'/><category scheme='http://www.blogger.com/atom/ns#' term='ubuntu'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Hadoop cluster on EC2 using Cloudera distribution of Whirr</title><content type='html'>&lt;span style="font-size: large;"&gt;&lt;b&gt;Motivation &lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There are a number of different tutorials that have been posted to get &lt;a href="http://hadoop.apache.org/"&gt;Hadoop&lt;/a&gt; up on an &lt;a href="http://aws.amazon.com/ec2/"&gt;EC2&lt;/a&gt; cluster and then run Hadoop jobs on this cluster from a remote machine.&amp;nbsp; I ended up using &lt;a href="http://archive.cloudera.com/cdh/3/whirr/"&gt;Whirr&lt;/a&gt; from &lt;a href="http://www.cloudera.com/"&gt;Cloudera&lt;/a&gt;&amp;nbsp;CDH3 and have been through a number of websites and discussion groups.&amp;nbsp; But thus far I have not found a way to get everything up without a few headaches.&amp;nbsp; I thought it might be useful to post what worked for me and warn of some pitfalls along the way.&amp;nbsp; These instructions or for a local machine running Ubuntu 10.10.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt;Terminology&lt;/b&gt;&lt;/span&gt; &lt;br /&gt;&lt;br /&gt;Most of the install will be done on your local machine with a bit of testing on the name node of the Hadoop cluster running on EC2.&amp;nbsp; Shell commands executed on your local machine will start with a dollar sign whereas shell commands executed on the remote name node will begin with a hash.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;commands run on the local machine&lt;/b&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;$ command&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;b&gt;commands run on name node of the cluster&lt;/b&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;# command&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;b&gt;&lt;span style="font-size: large;"&gt;Pitfalls&lt;/span&gt;&lt;/b&gt; &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Debian Packages are not worth it&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The Debian packages for Cloudera, including Hadoop and Whirr were built for Ubuntu 10.4.&amp;nbsp; Since I wanted to work with the latest release of Ubuntu, which at the time of this writing is 10.10,&amp;nbsp; the Debian packages for Cloudera were more difficult to install than just grabbing the tarball.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Hadoop versions on local machine and cluster must match &lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Hadoop must be the same version for the cluster and the local machine.&amp;nbsp; The default Whirr instance at the time of writing has Hadoop 0.20.2+737.&amp;nbsp; Therefore the 0.20.737 tarball must be used to run Hadoop jobs provided Cloudera AMI-based cluster.&amp;nbsp; I am currently using &lt;a href="http://archive.cloudera.com/cdh/3/hadoop-0.20.2+737.tar.gz"&gt;Cloudera CDH3 Hadoop 0.20.2+737.&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Depricated Hadoop configuration scheme&lt;/b&gt; &lt;br /&gt;&lt;br /&gt;Whirr uses a deprecated Hadoop configuration scheme.&amp;nbsp; It has not been an issue yet but it may be something to watch out for.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt;Installation Instructions &lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Install Open JDK&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Install the package from the apt repositories&lt;br /&gt;&lt;br /&gt;&lt;code&gt;$ sudo apt-get install openjdk-6-jre&amp;nbsp;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Set the $JAVA_HOME environment variable (see &lt;a href="http://numberformat.wordpress.com/2010/01/24/setting-environment-variables-in-ubuntu/"&gt;elsewhere&lt;/a&gt; for instructions on permanently setting environment variables)&lt;br /&gt;&lt;br /&gt;&lt;code&gt;$ export JAVA_HOME=/usr/lib/jvm/java-6-openjdk&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Set up SSH keys&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Generate a keypair to connect to the cluster&lt;br /&gt;&lt;br /&gt;&lt;code&gt;$ ssh-keygen -t rsa&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Get Hadoop tarball from Cloudera&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Download the tarball here: &lt;a href="http://archive.cloudera.com/cdh/3/hadoop-0.20.2+737.tar.gz"&gt;Cloudera CDH3 Hadoop 0.20.2+737.&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Install Hadoop locally&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;$ mkdir cdh3&lt;/code&gt;&lt;br /&gt;&lt;code&gt;$ cd cdh3&lt;/code&gt;&lt;br /&gt;&lt;code&gt;$ tar -xzvf ../Downloads/hadoop-0.20.2+737.tar.gz&lt;/code&gt;&lt;br /&gt;&lt;code&gt;$ ln -s hadoop-0.20.2+737/ hadoop&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Feel free to add Hadoop&lt;b&gt; &lt;/b&gt;(and later Whirr) to your path so you don't have to specify the whole path with Whirr and Hadoop in future sessions.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Test hadoop locally&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;You should get a listing of your root file system when you run this command&lt;br /&gt;&lt;br /&gt;&lt;code&gt;$ hadoop/bin/hadoop fs -ls /&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Get Whirr tarball from Cloudera&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Download the tarball here: &lt;a href="http://archive.cloudera.com/cdh/3/whirr-0.1.0+23.tar.gz"&gt;Cloudera Whirr 0.1.0+23.&lt;/a&gt;&lt;br /&gt;&lt;b&gt;Install whirr locally.&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;$ tar -xzvf ../Downloads/whirr-0.1.0+23.tar.gz&lt;/code&gt;&lt;br /&gt;&lt;code&gt;$ ln -s whirr-0.1.0+23/ whirr&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Configure Whirr&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Edit ~/cdh3/whirr-config/whirr.cfg&lt;br /&gt;&lt;br /&gt;&lt;code&gt;whirr.service-name=hadoop&lt;br /&gt;whirr.cluster-name=hadoopcluster&lt;br /&gt;whirr.instance-templates=1 jt+nn,1 dn+tt&lt;br /&gt;whirr.provider=ec2&lt;br /&gt;whirr.identity=&amp;lt;Your AWS ACCESS KEY ID&amp;gt;&lt;br /&gt;whirr.credential=&amp;lt;Your AWS SECRET ACCESS KEY&amp;gt;&lt;br /&gt;whirr.private-key-file=${sys:user.home}/.ssh/id_rsa&lt;br /&gt;whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub&lt;br /&gt;whirr.hadoop-install-runurl=cloudera/cdh/install&lt;br /&gt;whirr.hadoop-configure-runurl=cloudera/cdh/post-configure&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;This configuration sets up a cluster called hadoopcluster with one job tracker and name node and one data node and task tracker, using your AWS credentials and the rsa keys generated earlier.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Bring up an EC2 cluster using Whirr&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;$ whirr/bin/whirr launch-cluster --config whirr-config/whirr.cfg&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;This will take several minutes to complete.&amp;nbsp; When the cluster is up, connection info will be written out both to the screen and to a file for later use.&amp;nbsp;&amp;nbsp; &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Start the Proxy to the cluster&lt;/b&gt; &lt;br /&gt;&lt;br /&gt;&lt;code&gt;$~/.whirr/hadoopcluster/hadoop-proxy.sh&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;First test: can you SSH to the name node?&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;If Whirr is able to bring up a cluster successfully, it will print out the public address of the name node.&amp;nbsp; Use this address and the ssh key you generated to connect to the name node via ssh with a command similar to the following but with the provided IP address in the URI:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;ssh -i ~/.ssh/id_rsa ec2-user@ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Second test: can you invoke Hadoop in an SSH session on the name node?&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;If you can ssh into the cluster, can you execute a hadoop command locally on the cluster?&amp;nbsp; &lt;br /&gt;&lt;br /&gt;&lt;code&gt;# hadoop fs -ls /&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;(This is the only command that will actually be executed in the name  node from within the cluster.&amp;nbsp; The rest of the tutorial will be  performed back on your local machine)&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Configure Hadoop to use the cluster&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;$ export HADOOP_CONF_DIR=/$HOME/.whirr/hadoopcluster/&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Start the proxy in another shell (and leave it open)&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;$ sh ~/.whirr/hadoopcluster_dsb/hadoop-proxy.sh&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Minimally test hadoop on cluster from your local machine&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;$ hadoop/bin/hadoop fs -ls /&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;This is the exact command that you used to test whether Hadoop was working on your local machine.&amp;nbsp; However, Whirr defined a configuration for Hadoop which specifies the EC2 cluster and in updating the $HADOOP_CONF_DIR environment variable we are pointing Hadoop to this configuration.&lt;br /&gt;&lt;br /&gt;And this is it!&amp;nbsp; You now can run Hadoop jobs originating from your local machine on an a Hadoop EC2 cluster.&amp;nbsp; You also have what you need to SSH or SCP to the name node in case you would like to work on the cluster directly or upload some data outside of Hadoop.&amp;nbsp; Whirr can also, list the nodes that you have brought up and bring down the clusters that you have brought up.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;List the nodes in a cluster&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;$ whirr/bin/whirr list-cluster --config ~/cdh3/whirr-config/whirr.cfg&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Bring down the cluster&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Delete the cluster (and ec2 security roles)&lt;br /&gt;&lt;br /&gt;&lt;code&gt;$ whirr/bin/whirr destroy-cluster --config ~/cdh3/whirr-config/whirr.cfg&lt;/code&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7717060942576029301-6289859735831436483?l=ashenfad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ashenfad.blogspot.com/feeds/6289859735831436483/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://ashenfad.blogspot.com/2011/01/hadoop-cluster-on-ec2-using-cloudera.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default/6289859735831436483'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default/6289859735831436483'/><link rel='alternate' type='text/html' href='http://ashenfad.blogspot.com/2011/01/hadoop-cluster-on-ec2-using-cloudera.html' title='Hadoop cluster on EC2 using Cloudera distribution of Whirr'/><author><name>Darren</name><uri>http://www.blogger.com/profile/13329419403544669398</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/_1rhNonF73gM/SO5kZJ-w7vI/AAAAAAAAAOI/MlcJb6WT_MU/s1600-R/n571224255_279928_2818.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7717060942576029301.post-9201348015429514782</id><published>2010-12-23T00:06:00.000-08:00</published><updated>2011-05-07T20:00:36.119-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='map reduce'/><category scheme='http://www.blogger.com/atom/ns#' term='supervised learning'/><category scheme='http://www.blogger.com/atom/ns#' term='classification tree'/><category scheme='http://www.blogger.com/atom/ns#' term='c4.5'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Decision Trees &amp; Hadoop - Part 2: Approach</title><content type='html'>&lt;a href="http://ashenfad.blogspot.com/2010/12/decision-trees-in-hadoop-part-1-data.html"&gt;Part 1: Data&lt;/a&gt; | Part 2: Approach | &lt;a href="http://ashenfad.blogspot.com/2011/03/decision-trees-hadoop-part-3-results.html"&gt;Part 3: Results&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;If you're not already familiar with the basics of MapReduce, you may want to check out this &lt;a href="http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html"&gt;tutorial&lt;/a&gt; before reading further.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The box below shows my shorthand for the four MapReduce tasks that make up the core of the training algorithm. I use parenthesis to wrap key-value pairs. The brackets denote a list.&amp;nbsp; And I use 'X' when I don't care about the key.&lt;br /&gt;&lt;br /&gt;The main departure from the traditional classification tree training process is that we build the tree breadth first instead of depth first. Each iteration of tasks 2 and 3 will grow one layer of the tree.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="background-color: #fdfdfd; border: 1px solid rgb(204, 204, 204); padding: 10px;"&gt;&lt;div style="background-color: transparent; margin: 0px;"&gt;&lt;span id="internal-source-marker_0.49436280061490834" style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Task 1: &amp;nbsp;Initialize the Fields and Tree&lt;/span&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Map Input: (X, exampleInstance)&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Map Output: (field, fieldValue)&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Reduce: (field, fieldDescription)&lt;/span&gt;&lt;/div&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Create initial tree using the FieldDescriptions&lt;/span&gt;&lt;/div&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Task 2: &amp;nbsp;Find Best Split for Each Node-Field Pair&lt;/span&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Map Input: (X, exampleInstance)&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Map Output: ([leaf, field], [fieldValue, classValue])&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Reduce: (X, [leaf, field, fieldSplit,&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0pt; margin-left: 144pt; margin-top: 0pt; text-indent: 36pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;[informationGain, classCounts],&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0pt; margin-left: 144pt; margin-top: 0pt; text-indent: 36pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;[informationGain, classCounts]])&lt;/span&gt;&lt;/div&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Task 3: &amp;nbsp;Find Best Split for Each Node&lt;/span&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Map Input: (X, [leaf, field, fieldSplit,&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0pt; margin-left: 144pt; margin-top: 0pt; text-indent: 36pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;[informationGain, classCounts],&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0pt; margin-left: 144pt; margin-top: 0pt; text-indent: 36pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;[informationGain, classCounts]])&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Map Output: (leaf, [field, fieldSplit,&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0pt; margin-left: 144pt; margin-top: 0pt; text-indent: 36pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;[informationGain, classCounts],&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0pt; margin-left: 144pt; margin-top: 0pt; text-indent: 36pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;[informationGain, classCounts]])&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&lt;span style="background-color: transparent; color: black; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Reduce: (X, [&lt;/span&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;leaf, &lt;/span&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;field, fieldSplit,&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0pt; margin-left: 144pt; margin-top: 0pt; text-indent: 36pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;[informationGain, classCounts],&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0pt; margin-left: 144pt; margin-top: 0pt; text-indent: 36pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;[informationGain, classCounts]])&lt;/span&gt;&lt;/div&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;/span&gt;Grow Tree! &amp;nbsp;Mark any leaves that are finished growing as final.&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Repeat tasks 2 and 3 until tree is fully grown. &amp;nbsp;Optionally include task 4 in the loop.&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Task 4: &amp;nbsp;Prune Instances (remove example instances from final leaves)&lt;/span&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Map Input: (X, exampleInstance)&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Map Output: (X, exampleInstance)&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt; &amp;nbsp;- Only output the example if it doesn’t evaluate to a leaf node&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;"&gt;&lt;span style="background-color: transparent; color: black; font-family: 'Courier New',Courier,monospace; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Reduce: Identity&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Task 1: Initialize Fields and Tree&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;The first task scans over the input dataset (for now only CSV is supported) to determine whether each field is categorical or numeric. For each categorical field it builds a frequency count over the possible categories. For numeric fields it finds the minimum and maximum values.&lt;br /&gt;&lt;br /&gt;The map function for the first task will iterate over the fields of an example instance and output the field id and its value as the key-value pair.&lt;br /&gt;&lt;br /&gt;For the Mandelbrot data introduced in the&amp;nbsp;&lt;a href="http://ashenfad.blogspot.com/2010/12/decision-trees-in-hadoop-part-1-data.html"&gt;last post&lt;/a&gt;&amp;nbsp;there&amp;nbsp;are three fields.&amp;nbsp; The first two are numeric and the third is categorical; "true" or "false". Below is the map output for a few example points.&lt;br /&gt;&lt;br /&gt;&lt;div style="background-color: #fdfdfd; border: 1px solid rgb(204, 204, 204); padding: 10px;"&gt;&lt;div style="text-align: left;"&gt;Map Input -&amp;gt; Map Output&lt;br /&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;(X, [0.322, 0.093, false])&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;-&amp;gt;&amp;nbsp;&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;(0, 0.322)&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;, (1, 0.093)&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;, (2, false)&lt;/span&gt;&lt;/div&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;(X, [&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;0.301, 0.085&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;, false])&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;-&amp;gt;&amp;nbsp;&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;(0, 0.301),&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt; (1, 0.085), &lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;(2, false)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;(X, [&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;0.377, 0.086, true&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;])&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;-&amp;gt;&amp;nbsp;&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;(0, 0.377),&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt; (1, 0.086), &lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;(2, true)&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;(X, [&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;0.321, 0.089, false&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;])&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;-&amp;gt;&amp;nbsp;&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;(0, 0.321),&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;(1, 0.089),&amp;nbsp;&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;(2, false)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;Since we used the field id as a key, the reduce function iterates over all the values for a field. For now, I assume that categorical fields will never have values parsable as a number (things like zipcode would break this assumption). It's restrictive but it makes it easy to discriminate between data types. I'll mention a better way to determine type in a later post.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="background-color: #fdfdfd; border: 1px solid rgb(204, 204, 204); padding: 10px;"&gt;&lt;div style="text-align: left;"&gt;Reduce Input -&amp;gt; Reduce Output&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;(0, [0.322, 0.301, 0.377, 0.321])&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;-&amp;gt; [numeric, min:0.301, max:0.377]&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;(1, [0.093, 0.085, 0.086, 0.089])&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;-&amp;gt; [numeric, min:0.085, max:0.093]&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;(2, [false, false, true, false])&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;-&amp;gt; [categorical, false:3, true:1]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;We now have field descriptions for our training dataset. &amp;nbsp;Also, we require the index of the objective class field as a parameter when running this algorithm. &amp;nbsp;With the class field and the field descriptions, we have enough information to create a classification tree. &amp;nbsp;At this point, it's nothing more than a root node. &amp;nbsp;But it's a start.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;I have a home-cooked XML format for classification trees. Ideally I'd use something like &lt;a href="http://www.dmg.org/v4-0-1/GeneralStructure.html"&gt;PMML&lt;/a&gt;, but for now that's still on the to-do list. Anyway, here is the simple tree for our four examples after task 1 completes:&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="background-color: #fdfdfd; border: 1px solid rgb(204, 204, 204); padding: 10px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;lt;tree objectiveFieldIndex="2"&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp;&amp;lt;fields&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;lt;field index="0" isCategorical="false" min="0.301" max="0.377"/&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;lt;field index="1" isCategorical="false" min="0.085" max="0.093"/&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;lt;field index="2" isCategorical="true"&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;lt;category value="false" count="3"/&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;lt;category value="true" count="1"/&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;lt;/field&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp;&amp;lt;/fields&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp;&amp;lt;root id="0" isLeaf="false"&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;lt;classCounts&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;lt;classCount classCategory="false" count="3"/&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;lt;classCount classCategory="true" count="1"/&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;lt;/classCounts&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp;&amp;lt;/root&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;lt;/tree&amp;gt;&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Task 2: Find Best Split for Each Node-Field Pair&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;This MapReduce task is core for the entire process. It does the majority of the work to find out which splitting conditions we should use to further grow the tree. But before it starts we need to use one of Hadoop's essential features, the &lt;a href="http://hadoop.apache.org/common/docs/r0.18.3/mapred_tutorial.html#DistributedCache"&gt;distributed cache&lt;/a&gt;. We use the distributed cache to share our tree's XML file with the machines that will be processing task 2.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Each map node will load our tree into memory before it starts working. Like task 1, every training example from our dataset will be mapped. The map method evaluates the training example on the classification tree to find which leaf node the example 'falls' into. &amp;nbsp;For each field (excepting the class field), the map method will output a key composed of the id of the leaf node and the field id. &amp;nbsp;The output value consists of the field's value and the class field's value.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;For our little example, the tree's root node has a node id of 0. &amp;nbsp;As you'd expect, all four of the training instances evaluate to the root node since it has no children.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="background-color: #fdfdfd; border: 1px solid rgb(204, 204, 204); padding: 10px;"&gt;&lt;div style="text-align: left;"&gt;&lt;div style="margin: 0px;"&gt;Map Input -&amp;gt; Map Output&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;(X, [0.322, 0.093, false])&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;-&amp;gt;&amp;nbsp;&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;([0, 0], [0.322, false])&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;,&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;([0, 1], [0.093, false])&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;(X, [&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;0.301, 0.085&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;, false])&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;-&amp;gt;&amp;nbsp;&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;([0, 0], [0.301, false]),&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;([0, 1], [0.085, false])&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;(X, [&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;0.377, 0.086, true&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;])&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;-&amp;gt;&amp;nbsp;&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;([0, 0], [0.377, true]),&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;([0, 1], [0.086, true])&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px; text-align: left;"&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;(X, [&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;0.321, 0.089, false&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;])&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;-&amp;gt;&amp;nbsp;&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;([0, 0], [0.321, false]),&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;([0, 1], [0.089, false])&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;div style="text-align: justify;"&gt;The reduce step will iterate over all the values for each node-field pair. During this task&amp;nbsp;we're looking to find the best split possible given a field and a tree leaf node. To do this, during our iteration we maintain class category frequency counts for all candidate splits. After we've gathered the class frequency counts we can compute which split has the highest &lt;a href="http://en.wikipedia.org/wiki/Information_gain"&gt;information gain&lt;/a&gt;.&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;div style="text-align: justify;"&gt;To maintain the counts for each split we must first determine the set of possible splits. To keep it simple, we only consider binary splits. For a categorical field there will be a candidate split for each possible category (field == someCategory).&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;For a numeric fields the set of candidate splits are a little more tricky (field &amp;lt;= someNumber).&amp;nbsp;The original C4.5 formulation suggested testing a split between each adjacent pair of field values. This, however, doesn't scale so nicely to 'big data' problems. If we're running on a dataset with 10 million points we could end up testing 10 million possible splits for each numeric field. No good.&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;div style="text-align: justify;"&gt;So for this algorithm, we instead use a fixed number of split candidates. I normally use 10000 split candidates, but any value should work as long as there's memory enough to maintain that many separate class frequency counts. We evenly divide a numeric field's range into 'splitCandidates+1' buckets. The possible split values are the thresholds between the buckets. Now while the reducer iterates, we just maintain the class frequency count for each bucket and we have the information needed to find the best split point.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;For our little example, I used 2 split candidates for numeric fields. That means we'll divide a numeric field's range into 3 buckets. &amp;nbsp;Field 0 has a range of 0.301 to 0.377. So each bucket will be 0.025 wide and our split candidate points will be at 0.326 and 0.351. The reducer can now count up class frequencies and find the best split point for the feature. The final output for the reducing step will look something like:&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="background-color: #fdfdfd; border: 1px solid rgb(204, 204, 204); padding: 10px;"&gt;&lt;div style="text-align: left;"&gt;&lt;div style="margin: 0px;"&gt;Reduce Input -&amp;gt; Reduce Output&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;([0, 0], [&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;[0.322, false]&lt;/span&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;,&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;[0.301, false],&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;[0.377, true],&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;[0.321, false]])&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;-&amp;gt;&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;(X, [0, 0,&amp;nbsp;&amp;lt;= 0.326, 0.811, [false:3, true:0], [false:0, true:1]])&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;([0, 1], [[0.093, false],&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;[0.085, false]&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;,&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;[0.086, true],&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;[0.089, false]])&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;-&amp;gt;&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;(X, [0, 1, &amp;lt;= 0.0877, 0.311, [false:1, true:1], [false:2, true:0]])&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin: 0px;"&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Task 3: Find Best Split for Each Node&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;For each leaf node, task 2 will produce the best split for each field. &amp;nbsp;Task 3 will narrow this down by choosing the single best split for each leaf node.&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The MapReduce steps in this task are straight forward. The map method takes the output task 2 and simply pulls the leaf id out of the value and makes it the key.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="background-color: #fdfdfd; border: 1px solid rgb(204, 204, 204); padding: 10px;"&gt;&lt;div style="text-align: left;"&gt;&lt;div style="margin: 0px;"&gt;&lt;div style="margin: 0px;"&gt;Map Input -&amp;gt; Map Output&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-family: 'Courier New',Courier,monospace;"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;div style="margin: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;(X, [0, 0,&amp;nbsp;&amp;lt;= 0.326, 0.811, [false:3, true:0], [false:0, true:1]])&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;-&amp;gt;&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;(0, [0,&amp;nbsp;&amp;lt;= 0.326, 0.811, [false:3, true:0], [false:0, true:1])&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;div style="margin: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;(X, [0, 1, &amp;lt;= 0.0877, 0.311, [false:1, true:1], [false:2, true:0]])&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;-&amp;gt;&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;(0,&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;[1, &amp;lt;= 0.0877, 0.311, [false:1, true:1], [false:2, true:0]])&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;The reduce step will iterate over all the candidate splits for a leaf node and simply output the one with the maximum information gain.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="background-color: #fdfdfd; border: 1px solid rgb(204, 204, 204); padding: 10px;"&gt;&lt;div style="text-align: left;"&gt;&lt;div style="margin: 0px;"&gt;&lt;div style="margin: 0px;"&gt;Reduce Input -&amp;gt; Reduce Output&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;div style="margin: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;(0, [[0,&amp;nbsp;&amp;lt;= 0.326, 0.811, [false:3, true:0], [false:0, true:1],&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;[0, 1, &amp;lt;= 0.0877, 0.311, [false:1, true:1], [false:2, true:0]]&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;)&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;-&amp;gt; (0,&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;[0,&amp;nbsp;&amp;lt;= 0.326, 0.811, [false:3, true:0], [false:0, true:1])&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;The reducer output gives us the information we need to grow the next generation of the tree. We iterate steps 2 and 3 until the tree is fully grown. And there we have a basic algorithm for growing a classification tree on "big data". I'll cover some caveats and possible improvements to this approach in the next section.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;Continue to &lt;a href="http://ashenfad.blogspot.com/2011/03/decision-trees-hadoop-part-3-results.html"&gt;Part 3: Results&lt;/a&gt;.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7717060942576029301-9201348015429514782?l=ashenfad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ashenfad.blogspot.com/feeds/9201348015429514782/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://ashenfad.blogspot.com/2010/12/decision-trees-hadoop-part-2-approach.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default/9201348015429514782'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default/9201348015429514782'/><link rel='alternate' type='text/html' href='http://ashenfad.blogspot.com/2010/12/decision-trees-hadoop-part-2-approach.html' title='Decision Trees &amp; Hadoop - Part 2: Approach'/><author><name>Adam</name><uri>http://www.blogger.com/profile/17499355093104727120</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-z7IbKEsJpFk/Th6kWqEWzSI/AAAAAAAAAc4/ePllTfpQKSM/s220/aa.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7717060942576029301.post-7855654820041601094</id><published>2010-12-21T18:20:00.000-08:00</published><updated>2011-05-17T12:29:24.857-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='map reduce'/><category scheme='http://www.blogger.com/atom/ns#' term='supervised learning'/><category scheme='http://www.blogger.com/atom/ns#' term='classification tree'/><category scheme='http://www.blogger.com/atom/ns#' term='c4.5'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Decision Trees &amp; Hadoop - Part 1: Data</title><content type='html'>&lt;div style="text-align: justify;"&gt;Part 1: Data |&amp;nbsp;&lt;a href="http://ashenfad.blogspot.com/2010/12/decision-trees-hadoop-part-2-approach.html"&gt;Part 2: Approach&lt;/a&gt;&amp;nbsp;| &lt;a href="http://ashenfad.blogspot.com/2011/03/decision-trees-hadoop-part-3-results.html"&gt;Part 3: Results&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;-----&lt;br /&gt;Update: &amp;nbsp;Looks like some folks at Google have built a similar system (albeit much more polished). &amp;nbsp;You can check it out &lt;a href="http://research.google.com/pubs/pub36296.html"&gt;here&lt;/a&gt;.&lt;br /&gt;-----&lt;br /&gt;&lt;br /&gt;Recently I've been exploring&amp;nbsp;&lt;a href="http://hadoop.apache.org/"&gt;Hadoop&lt;/a&gt; and some of its satellite projects. &lt;a href="http://en.wikipedia.org/wiki/MapReduce"&gt;MapReduce&lt;/a&gt; is a powerful paradigm for tackling huge datasets and I wanted a project to help me ease into it all.&amp;nbsp;&lt;a href="http://mahout.apache.org/"&gt;Mahout&lt;/a&gt;, a collection of machine learning algorithms for Hadoop, didn't yet have a &lt;a href="http://en.wikipedia.org/wiki/C4.5_algorithm"&gt;C4.5&lt;/a&gt;-like implementation for&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/Decision_tree_learning"&gt;decision trees&lt;/a&gt;. C4.5 trees are useful yet conceptually simple, so that seemed like a good place to start.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The following few posts will cover my approach toward building a C4.5-like classification tree for Hadoop and follow up with my plans for a regression tree. &amp;nbsp;This is an educational exercise for me, so if you notice I'm doing something stupid or just have some MapReduce tips or tricks, please let me know!&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;b&gt;Why do we care?&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;First things first, why do we care? Why bother with all this Hadoop stuff? How about just sampling a large dataset until the result is small enough for a single machine?&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;For the most part, sampling really is good enough. But if a problem space is extremely complex, then a smaller sample may not give an adequately clear view.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;As an example, I've chosen an area of the Mandelbrot set (in &lt;a href="http://mathworld.wolfram.com/ElephantValley.html"&gt;Elephant Valley&lt;/a&gt;) as my problem space. I choose random points and label them true or false depending on whether they're in the Mandelbrot set or not.&lt;br /&gt;&lt;blockquote&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;Mandelbrot CSV Sample: &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;0.322, 0.093, false&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;0.301, 0.085, false&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;0.377, 0.086, true&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;0.321, 0.089, false&lt;/span&gt;&lt;/blockquote&gt;I generated four datasets from this space with sizes of 10 million, 1 million, 100 thousand, and 10 thousand points. I plotted the datasets below by coloring a point red if it's within the Mandelbrot set, green if it is not. As you can see in the plots, 10 million points gives a fairly high fidelity picture of the space. Not surprisingly, 1 million, 100K, and 10K samples give progressively worse detail. &amp;nbsp;A classifier trained on one of the smaller samples will do well for the majority of space but the lack of resolution near the boundary will make that area troublesome.&lt;br /&gt;&lt;br /&gt;Using either&amp;nbsp;&lt;a href="http://www.cs.waikato.ac.nz/ml/weka/"&gt;Weka&lt;/a&gt;'s RepTrees (similar to C4.5) or RandomForests, I can learn decision trees for the 10K, 100K, and 1M datasets. The 10M dataset, however, is too large to train the classifiers inside my humble 2GB JVM.&lt;br /&gt;&lt;br /&gt;I also generated a second set of data that mirrors the first, but adds 50 dummy features. Each dummy feature is simply a random number between 0 and 1. For this dataset the Weka tree classifiers can only cope with the 10K and 100K samples.&lt;br /&gt;&lt;br /&gt;A Hadoop classifier can tackle the larger datasets and, hopefully, produce better results for points in the tricky areas of our problem space. Sampling and using a meta-classifier (like bagging) might do just as well. But, like I said earlier, this is an educational exercise. :)&lt;/div&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-bottom: 0.5em; margin-left: auto; margin-right: auto; padding: 6px; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/_QfKbkQ_T7ag/TQ-ZZo_4tsI/AAAAAAAAAPo/gfPsGJKlx4s/s1600/plot10M.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/_QfKbkQ_T7ag/TQ-ZZo_4tsI/AAAAAAAAAPo/gfPsGJKlx4s/s1600/plot10M.png" style="cursor: move;" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="font-size: 13px; padding-top: 4px; text-align: center;"&gt;10M Sample&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-bottom: 0.5em; margin-left: auto; margin-right: auto; padding: 6px; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/_QfKbkQ_T7ag/TQ-ZYDTc0eI/AAAAAAAAAPg/CyOZoovX6VE/s1600/plot1M.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/_QfKbkQ_T7ag/TQ-ZYDTc0eI/AAAAAAAAAPg/CyOZoovX6VE/s1600/plot1M.png" style="cursor: move;" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="font-size: 13px; padding-top: 4px; text-align: center;"&gt;1M Sample&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-bottom: 0.5em; margin-left: auto; margin-right: auto; padding: 6px; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/_QfKbkQ_T7ag/TQ-ZaNCw7jI/AAAAAAAAAPs/3LGtOBpnGiw/s1600/plot100K.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/_QfKbkQ_T7ag/TQ-ZaNCw7jI/AAAAAAAAAPs/3LGtOBpnGiw/s1600/plot100K.png" style="cursor: move;" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="font-size: 13px; padding-top: 4px; text-align: center;"&gt;100K Sample&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/_QfKbkQ_T7ag/TQ-ZZJdAmEI/AAAAAAAAAPk/-_oqOT2JNdQ/s1600/plot10K.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/_QfKbkQ_T7ag/TQ-ZZJdAmEI/AAAAAAAAAPk/-_oqOT2JNdQ/s1600/plot10K.png" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;10K Sample&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;div&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;So that's the brief introduction to the data I'll be using in the following posts. Head to the next section to read about the general design for my Hadoop classification tree.&lt;br /&gt;&lt;br /&gt;Continue to &lt;a href="http://ashenfad.blogspot.com/2010/12/decision-trees-hadoop-part-2-approach.html"&gt;Part 2: Approach&lt;/a&gt;.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7717060942576029301-7855654820041601094?l=ashenfad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ashenfad.blogspot.com/feeds/7855654820041601094/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://ashenfad.blogspot.com/2010/12/decision-trees-in-hadoop-part-1-data.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default/7855654820041601094'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default/7855654820041601094'/><link rel='alternate' type='text/html' href='http://ashenfad.blogspot.com/2010/12/decision-trees-in-hadoop-part-1-data.html' title='Decision Trees &amp; Hadoop - Part 1: Data'/><author><name>Adam</name><uri>http://www.blogger.com/profile/17499355093104727120</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-z7IbKEsJpFk/Th6kWqEWzSI/AAAAAAAAAc4/ePllTfpQKSM/s220/aa.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_QfKbkQ_T7ag/TQ-ZZo_4tsI/AAAAAAAAAPo/gfPsGJKlx4s/s72-c/plot10M.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7717060942576029301.post-4433934312693678568</id><published>2010-12-20T09:51:00.000-08:00</published><updated>2011-01-11T15:36:02.641-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='partisanship'/><category scheme='http://www.blogger.com/atom/ns#' term='topolitics'/><category scheme='http://www.blogger.com/atom/ns#' term='politics'/><title type='text'>Partisanship</title><content type='html'>An analysis I put together on congressional roll-call votes:&lt;br /&gt;&lt;div&gt;&lt;a href="http://blog.topolitics.com/2010/11/crisis-of-partisanship.html"&gt;http://blog.topolitics.com/2010/11/crisis-of-partisanship.html&lt;/a&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7717060942576029301-4433934312693678568?l=ashenfad.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ashenfad.blogspot.com/feeds/4433934312693678568/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://ashenfad.blogspot.com/2010/12/partisanship.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default/4433934312693678568'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7717060942576029301/posts/default/4433934312693678568'/><link rel='alternate' type='text/html' href='http://ashenfad.blogspot.com/2010/12/partisanship.html' title='Partisanship'/><author><name>Adam</name><uri>http://www.blogger.com/profile/17499355093104727120</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-z7IbKEsJpFk/Th6kWqEWzSI/AAAAAAAAAc4/ePllTfpQKSM/s220/aa.jpg'/></author><thr:total>0</thr:total></entry></feed>
