How to Complicate Open Data

By Aaron O. Ellis

Tuesday, April 1, 2014

Cities are increasingly driven by data. When that data is released to the public the city is rightfully treating information as infrastructure. But as with all infrastructure, it must be constructed wisely and maintained properly.

Here are a few ways open data can go wrong.

Call a Field “UniqueId” and Make It Not Unique

Liquor licenses in the city of Denver are available for download in a variety of formats and are updated daily.

The schema for this data includes the field UNIQUE_ID, which surprisingly or not surprisingly depending on your level of cynicism, is not actually unique.

Unique identifiers, when actually unique, provide a way to quickly index and reference objects. Duplication makes this process impossible. Anyone who shares a first name with a friend knows this dilemma well.

Luckily in the case of the licenses, the duplicate ids do not differ and can be removed. But this is an unnecessary additional step in taking the data from a raw to usable state.

Don’t Tell Anyone Your Coordinate System

The spherical nature of the Earth complicates geographical calculations. The solution used in many geographic information systems is to transform spherical coordinates onto a flat plane. Within a large area, even one as large as a third of Colorado, errors due to this transformation will be less than 1:10,000.

The liquor license schema also includes the fields X_COORD and Y_COORD, but does not name its coordinate system or zone in the file or metadata. They happen to belong to the Colorado Central Grid, just one of the zones in the State Plane Coordinate System.

Transforming the given x and y values to more recognizable coordinates, such as latitude and longitude, is a quick calculation, but only if you know what coordinate system you’re using.

Randomly Sort the File

Checking if two files differ should not be a difficult operation. On Unix-based systems there are a number of tools installed by default to compare files. Continuing with the city of Denver liquor licenses, we can see the byte count of the downloaded files over successive days with ls -l:

392152 Mar 28 03:00 licenses_2014-03-28.csv
393545 Mar 29 03:00 licenses_2014-03-29.csv
393545 Mar 30 03:00 licenses_2014-03-30.csv

The files on the 29th and 30th have the same number of bytes, but are the actual bytes the same? We can quickly calculate a checksum of these files with sha1sum:

08260d863858ceacecdeda3f63727ba954e63cb7  licenses_2014-03-28.csv
bde511e788140672682ad985c2710022772f5034  licenses_2014-03-29.csv
db235e5ce4c4236417f265bd85645257d123d6b2  licenses_2014-03-30.csv

Despite the last two files having the same number of bytes, they actually differ. Unfortunately, a SHA1 checksum cannot tell us the magnitude of these differences, but we can use diff to perform a line by line comparison. For instance, if the name on a license changed, we’d see the following:

< 331902_LI32,1063820,331902,WHOLE FOODS MARKET,7400 E HAMPDEN AVE # D-1,LI32,1,LIQUOR-3.2 % BEER,3.2 % Beer - City,2014-03-06 00:00:00,2015-02-19 00:00:00,C,0,180030.0,3,4,06809,V,3168440.0,1663019.0
---
> 331902_LI32,1063820,331902,HOLE FOODIE MARKET,7400 E HAMPDEN AVE # D-1,LI32,1,LIQUOR-3.2 % BEER,3.2 % Beer - City,2014-03-06 00:00:00,2015-02-19 00:00:00,C,0,180030.0,3,4,06809,V,3168440.0,1663019.0

But running diff on the files from the 29th and 30th tells us that 1,525 lines have changed, nearly three quarters of their total 2,097 lines. What dramatic changes could have occurred in one day? Did a dutiful government worker spend his Saturday rewriting liquor licenses?

Spoiler alert: no. Nothing changed.

On a hunch, I sorted these files. Unix provides a tool called sort, however, its usage is made slightly more complicated by the need to skip the header in the first row:

(head -n 1 licenses_2014-03-30.csv && tail -n +2 licenses_2014-03-30.csv | sort) > sorted_2014-03-30.csv

After sorting both the files, we can rerun the checksum:

d978ff44979f0fa7c3d33b7f4db7df909856f6f8  sorted_2014-03-29.csv
d978ff44979f0fa7c3d33b7f4db7df909856f6f8  sorted_2014-03-30.csv

These files contain exactly the same information. They were randomly sorted before publishing.

Don’t Escape Values

The comma-separated values file format has been used for decades and is supported by nearly every spreadsheet and database. Other characters can be used to separate values, such as tabs, but no matter the implementation there must be a way to include the character used to separate within actual values. A widely implemented solution for this dilemma is to use double quotes to wrap a value that itself contains separating characters. For instance, instead of the ambiguous:

Eats, Shoots & Leaves,Truss, Lynne

The values would be wrapped in double quotes to show that this row only contains a book title and the author by last name:

"Eats, Shoots & Leaves","Truss, Lynne"

I’ve seen naive implementations of CSV writers forget to escape values in this manner, corrupting their output as a result. But there is another way this process can go wrong. Since the double quote now has significance, any double quotes in the value must be escaped. For instance:

"\"You miss 100% of the shots you don't take.\" -Wayne Gretzky." -Michael Scott

The city of Denver releases many files in CSV format, including their checkbook and assessor sales. But both these files contain escaping errors, such as the following:

A simple rule: if your file cannot be parsed by a CSV reader, it shouldn’t end in .csv.

Only Provide a PDF

Street closure data for the city of Denver is updated every two weeks. It is released in Portable Document Format, which until 2008 was a proprietary format controlled by Adobe Systems. The city recommends the following method for finding a closure important to you:

If looking for a particular street, hit Ctrl + F to find it within the document.

Releasing data as a PDF complicates parsing and introduces unnecessary roadblocks for developers looking to build something better than Ctrl + F.