We need more votes
I have a problem. As I’ve mentioned before, I derive an odd pleasure from organizing data and making pipelines that organize data. It’s a sickness, to the point where sometimes someone can point me at a dataset that is either noisy or needlessly inaccessible, and writing a pile of Python to organize it becomes the thing I do to relax, scratching that part of my brain that wants to see efficiency everywhere.
This recently intersected with my recent involvement in local politics, when someone showed me that this is how the Pennsylvania legislative branch officially releases their votes. All the data is there, but it’s split up on multiple pages.
- There are two legislative chambers.
- Each chamber’s activity is divided up into Sessions (mostly years, but with some exceptions)
- Each session has some number of days.
- Each day has some number of “roll calls”.
- Each vote has a list of the people and how they voted.
My naive little brain then just thought that I could knock out a simple web scraper to grab all that information. So that’s basically what I did, resulting in PALegislature on GitHub. However, it definitely followed the Ninety-ninety rule in that the last 10% of “weird” corner cases ended up taking a lot of time.
Not to mention that one of the surprisingly trickiest things was to match the names. See, most of the time the vote page just lists the last name of the legislator, except if there is more than one legislator with that last name, then they are given first initials. To make it worse, if the legislator joins part way through the session, some votes in that session will be just the last name, and some will be the last name and the initial. That all led to me having to write a fair bit of code for fuzzy name matching, and then hand annotating a bunch of the corner cases, which I hate, but was necessary. Usually I’m against more voter id, but in this case, I’d settle for a unique identifier for each legislator.
In conclusion, beware: [Original comic by XKCD]