Each week I'm trying to identify and hack on data that is published in the Watertown document center in order to try providing it in a manner that people might find more interesting than a PDF, Excel or PowerPoint presentation. Today I decided to focus effort on the building permit submissions. We’re about to do some water damage work in my daughters room so this felt like a relevant area to hack on.
The Patch blog doesn't allow embeding of dynamic maps so you'll have to either head over to my blog to view the full details or click on the attached photo to get a sense of how this works.
Determine the effort involved in creating an automated process to place issued building permits from Watertown, MA on a Google map.
Data for this hack
Tools used in this hack
- DeskUNPDF (Currently using a trial version but it looks like it is working great)
- A little Ruby scripting and some Ruby Gems (csv, typhoeus, json)
- Google Geocoding API
- Socrata was super easy for visualizing/mapping the data
My Hack Results and a description
Overall I’m pretty happy with how this hack turned out. I was able to take a very boring presentation of this pretty interesting information that was locked up in a PDF and display it on a map. Here are the hoops that I had to jump through to get this working.
- Download the PDF from the Watertown website.
- Open up DeskUNPDF and use their conversion tool to identify tabular data in a PDF so it could be extracted as a comma separated value (CSV) file.
- Write a little Ruby script that further cleaned up the data in preparation for generating the latitude and longitude coordinates for use in the Google Geocoding API. After getting the lat/long coordinates from Google I then had to write it out as a new CSV file for Socrata to take over.
- Upload a new data set to Socrata and mark the lat/long fields into a location field
- Use Socrata to generate the Google Map view
How to improve this
- Automate the downloading of building permit PDFs from the Watertown website
- Detect when new PDFs are available for download
- Script the PDF to CSV conversion using the DeskUNPDF command line interface rather than the UI
- Automate the updating of the dataset on Socrata
One major sticking point that I have is that DeskUNPDF isn’t picking up the first two rows in each PDF so I need to ask them if they know what might be going on with that. Missing the first two rows isn’t a huge deal but I’d like the dataset to be accurate.
While I think seeing the individual permits on a map is better than looking at a list in a PDF, I would like to see this data in graph form with permits plotted over time. Obviously summer months are high for permits but I would be quite interested to see how weeks and months compare with previous years.
Not a ton of code here but you can find the Ruby script below. If others are interested in helping me write scrapers for this data the scripts will be updated in github.