I’ve had a few people ask where I got the data for my last couple of posts, so I thought I’d put together a quick how-to for performing the same analysis, or running your own queries. I’ve used MySQL for this, but you should be able to easily convert these statements to work with your data store of choice. (Feel free to post them in the comments!)
First of all, you’re going to need to get the raw data. The AEC provides this as a bunch of CSV files, available for download on their site. The 2010 data can be found here, the 2007 data can be found here.1 Grab the National list of Candidates, and the State Below the Line Preferences files.
Next up, create some tables to hold this data. Here’s the table definition for the candidates table:
CREATE TABLE `2010_candidates` ( `state` varchar(3) NOT NULL, `party` varchar(5) NOT NULL, `party_name` text NOT NULL, `candidate_id` bigint(20) NOT NULL, `surname` text NOT NULL, `name` text NOT NULL, `elected` varchar(1) NOT NULL, `historic_elected` varchar(1) NOT NULL, KEY `party` (`party`), KEY `candidate_state` (`candidate_id`,`state`) ) ENGINE=InnoDB;
You’ll also need to create a table for each state. You could put it in one big table, but I was doing all of this on my local machine with a limited amount of RAM, so I wanted to keep the tables small:
CREATE TABLE `2010_prefs_nsw` ( `candidate_id` bigint(20) NOT NULL, `preference` int(11) NOT NULL, `batch` int(11) NOT NULL, `paper` int(11) NOT NULL, KEY `candidate_pref` (`candidate_id`,`preference`) ) ENGINE=InnoDB;
Importing the data is pretty easy. Just run this query for the Candidate file:
LOAD DATA LOCAL INFILE "/path/to/SenateCandidatesDownload-15508.csv" INTO TABLE 2010_candidates FIELDS TERMINATED BY ',' IGNORE 2 LINES;
Similarly, run this query for each state (modifying for the correct file name and table name, of course).
LOAD DATA LOCAL INFILE "/path/to/SenateStateBTLPreferences-15508-NSW.csv" INTO TABLE 2010_prefs_nsw FIELDS TERMINATED BY ',' IGNORE 2 LINES;
You now have all of the raw data. I preferred to make a summary table to hold the preference count for each candidate:
CREATE TABLE `2010_prefs_summary` ( `candidate_id` bigint(20) NOT NULL, `preference` int(11) NOT NULL, `preference_total` bigint(20) NOT NULL, KEY `candidate_pref` (`candidate_id`,`preference`) ) ENGINE=InnoDB;
Populating this table is pretty easy. Just repeat this query for each state (note the reference to the 2010_prefs_nsw table, and the “NSW” WHERE clause):
INSERT INTO 2010_prefs_summary ( SELECT c.candidate_id, p.preference, count( p.preference ) AS preference_total FROM 2010_candidates c LEFT JOIN 2010_prefs_nsw p ON c.candidate_id=p.candidate_id WHERE p.state="NSW" );
Now, grabbing the preference data for any candidate is quite easy, you just need to know their candidate_id (look it up in the 2010_candidates table):
SELECT * FROM 2010_prefs_summary WHERE candidate_id=12345;
Have fun! If you come up with anything interesting, let me know on Twitter, I’d love to hear about it!
- I’ve contacted the AEC, but it seems they don’t have data available from before the 2007 federal election – I assume they just didn’t store the raw counting data. ↩