Analyzing presidential elections

Given the timeliness of the 2012 US Presidential Election and the inherent intricacies of the electoral process, we internally used an interesting dataset on poll results to test RethinkDB's query language support in the Javascript, Python, and Ruby client libraries.

We'll use this dataset to walk through RethinkDB's Javascript query language using either the Data Explorer in the web admin UI or the Node.js REPL.

If you'd like to follow along: Set up an instance of RethinkDB using these simple instructions. Then download the datasets we'll be using and import the CSV files:

$ rethinkdb import --join localhost:29015 --table test.input_polls --primary-key uuid --input-file pres_polls.csv
$ rethinkdb import --join localhost:29015 --table test.county_stats --input-file CO-EST2011-Alldata.csv

What does this data set contain?

You can download the dataset files from RethinkDB's GitHub repository. There are two data sets:

  • the first set contains multiple poll results at state-level pres_polls.csv
  • the second set contains various population stats at county-level CO-EST2011-Alldata.csv

Here is an example document from each of these sets:

r.table('input_polls').limit(1)

{
     "Day": 92,
     "Len": 1,
     "id": "Montana",
     "EV": 3,
     "Dem": 40,
     "GOP": 49,
     "Ind": "",
     "Date": "Apr 02",
     "unnamed": "",
     "unnamed2": "",
     "unnamed3": "",
     "unnamed4": "",
     "unnamed5": "",
     "unnamed6": "",
     "unnamed7": "",
     "Pollster": "Rasmussen-1",
     "uuid": "00c36ccf-b499-ce1d-93e6-16b3cd186cde"
}
r.table('county_stats').limit(1)

{
     "SUMLEV": 50,
     "REGION": 3,
     "DIVISION": 6,
     "state": 21,
     "county": 129,
     "Stname": "Kentucky",
     "ctyname": "Lee County",
     "CENSUS2010POP": 7887,
     "ESTIMATESBASE2010": 7887,
     "POPESTIMATE2010": 7876,
     "POPESTIMATE2011": 7817,
     "NPOPCHG_2010": -11,
     "NPOPCHG_2011": -59,
     "Births2010": 16,
     "Births2011": 70,
     "Deaths2010": 25,
     "Deaths2011": 124,
     "NATURALINC2010": -9,
     "NATURALINC2011": -54,
     "INTERNATIONALMIG2010": 0,
     "INTERNATIONALMIG2011": 0,
     "DOMESTICMIG2010": -1,
     "DOMESTICMIG2011": -6,
     "NETMIG2010": -1,
     "NETMIG2011": -6,
     "RESIDUAL2010": -1,
     "RESIDUAL2011": 1,
     "GQESTIMATESBASE2010": 1016,
     "GQESTIMATES2010": 1016,
     "GQESTIMATES2011": 1016,
     "Rbirth2011": 8.921175,
     "Rdeath2011": 15.803224,
     "RNATURALINC2011": -6.882049,
     "RINTERNATIONALMIG2011": 0,
     "RDOMESTICMIG2011": -0.764672,
     "RNETMIG2011": -0.764672,
     "id": "00634d54-1283-4f47-a052-a1e134671495"
}

Data cleanup: chaining, GROUPBY, map

We'll first clean up the data from press_polls.csv that we imported into the input_polls table, as we want to calculate the average results of various polls at the state level. We'll also get rid of unnecessary/empty attributes. Finally we'll store the result in a new table:

// insert cleaned up data into a new table
r.table('polls').insert(
  r.table('input_polls').groupBy('id', 
    {
      // map phase: emit the poll results for each party and the state name
      'mapping': function(row) { 
        return rethinkdb.expr({'Dem': row('Dem'), 'GOP': row('GOP'), 'polls': 1, 'id': row('id')}) 
      },

      // reduction base values used to compute sums and averages
      'base': {'Dem': 0, 'GOP': 0, 'polls': 0}, 

      // reduction phase: sum up the poll results for each state
      'reduction': function(acc, doc) { 
        return r.expr({
            'Dem': acc('Dem').add(doc('Dem')),
            'GOP': acc('GOP').add(doc('GOP')),
            'polls': acc('polls').add(doc('polls')),
            'id': doc('id') })
      },

      // run at the end of GROUPBY operation to compute the averages 
      'finalizer': function(res) { 
        return { 
            'Dem': res('Dem').div(res('polls')), 
            'GOP': res('GOP').div(res('polls')), 
            'polls' : res('polls'),
            'id': res('id') }
      }
    })
    // return only the `reduction` attribute of the resulting docs
    .map(r.row('reduction')) 
)

r.table('polls').limit(1)

{
    "Dem": 42.294117647058826,
    "GOP": 48.294117647058826,
    "polls": 17,
    "id": "Arizona"
}

Data analysis: projections, JOINs, orderby, group-map-reduce

Based on this data let's try to see if we can figure out how many voters a party would need to turn to win the states. For the sake of this post, we'll go with the Democrats.

Let's start with what estimates polls project at the county level by JOINing the polls and county_stats tables:

r.table('county_stats').eqJoin('Stname', r.table('polls')) // equi join of the two tables
    .zip() // flatten the results
    .pluck('Stname', 'state', 'county', 'ctyname', 'CENSUS2010POP', 'POPESTIMATE2011', 'Dem', 'GOP') // projection

Building on this query, next we can find the counties where the Democrats are in minority:

r.table('county_stats').eqJoin('Stname', r.table('polls'))
    .zip() 
    .pluck('Stname', 'state', 'county', 'ctyname', 'CENSUS2010POP', 'POPESTIMATE2011', 'Dem', 'GOP')
    .filter(function(doc) { return doc('Dem').lt(doc('GOP')) })

Or even better where Democrats are within 15% of the Republicans:

r.table('county_stats').eqJoin('Stname', r.table('polls'))
    .zip() 
    .pluck('Stname', 'state', 'county', 'ctyname', 'CENSUS2010POP', 'POPESTIMATE2011', 'Dem', 'GOP')
    .filter(function(doc) { return doc('Dem').lt(doc('GOP')).and(doc('GOP').sub(doc('Dem')).lt(15)) })

The last step in answering the initial question of how many voters should the Democrats win to turn the results is just a groupedMapReduce away:

r.table('county_stats').eqJoin('Stname', r.table('polls'))
    .zip() 
    .pluck('Stname', 'state', 'county', 'ctyname', 'CENSUS2010POP', 'POPESTIMATE2011', 'Dem', 'GOP')
    .filter(function(doc) { return doc('Dem').lt(doc('GOP')).and(doc('GOP').sub(doc('Dem')).lt(15)) })
    .groupedMapReduce(
        function(doc){ return doc('Stname') }, 
        function(doc){ return doc('POPESTIMATE2011').mul(doc("GOP").sub(doc("DEM")).div(100)) }, 
        0, 
        function(acc, val) {return acc.add(val)})
    .orderBy('reduction')

And the outcome of our quick presidential election data analysis that addresses the question how many voters the Democrat party would need to turn to win the states (this assumes 100% turnout of the entire population of a state):

Data analysis with RethinkDB

If you followed along, the queries above should have given you a taste of ReQL: chaining, projections, order by, group by, JOINs, grouped-map-reduce. Of course this tutorial isn't statistically significant. If you interested in statistically significant results, checkout the election statistics superhero Nate Silver.