Additional Bulk Data Usage

The file location.tsv is not all that useful by itself. The first colulmn is labeled id but should not be confused with the location_id used by the api. The api's location_id is the concatentation of the latitute, an or bar symbol | and the longitude. Ex 38.984652|-77.0947092 for Bethesda, MD vs the id of ry9e7lv1rdae from location.tsv. Further, the latitudes and longitudes are truncated in location.tsv. They look like the float versions of doubles, inconveniently rounded to fewer decimal places. For Bethesda, MD location.tsv has 38.9847 -77.0947 so we cannot derive the location_id. The data dictionary doesn't spell it out but location.tsv has ids that are not the api's location_ids and imprecise latitudes and longitudes keeping us from generating the api's location_ids.

rawlocation.tsv is also not that useful by itself but I noticed it could be used in conjection with location.tsv. rawlocation.tsv has both the api's location_id and the id used in location.tsv. Here's where it gets a bit confusing though. In rawlocation.tsv the api's location_id is labeled latlong and the id used in location.tsv is labeled location_id. It's not labeled this way but emperically the second column of rawlocation.tsv should be labeled id and the last column labeled location_id. Then the one to many join might be easier to see. Take the ids in location.tsv and pick one match by id from rawlocation.tsv to get the api's location_id.

So I created a new and improved online location table that is comprised of the api's location_id from rawlocation.tsv (staring as the primary key) with the city, state, country, latitude and longitude from location.tsv (by matching the id in location.tsv to the column labeled location_id in rawlocations). I could have parsed the latlong column into doubles for the latitute and longitude but I resisted that temptation. Interestingly, there were 602 ids in location.tsv that were not found in rawlocation.tsv. Both files were downloaded 1/12/2018. I did not load these rows into my new and improved location table.

My use case is to figure out what 5 patent cities are the closest to a given latitude and longitude (think Pokémon Go!). I could use an online version of location.tsv to figure this out but the problem is then retrieving data for the 5 cities from the api's locations, inventors and assignees endpoints. The request I would need to make would have to be q={"_and":["location_city":"a"},{"location_state":"b"},{"location_contry":"c"}]} since location.tsv doesn't have the api's location_id. The problem is that the endpoint won't return results if there are UTF-8 characters in the city or state, there is a double quote in the city, state or country (and yes, there are examples of each in location.tsv) or if any of the fields are null (again, yes, there are examples of each in location.tsv). Additionally, single quotes would need to be doubled up. Ex the city Hashmona'im needs to be requested as {"location_city":"Hashmona''im"}. Fortunately, all of these problems go away if I use my new and improved table! My query can be for a specific location_id and I can even request multiple locations in a single request ex: q={"location_id":["55.3485154|-131.64671","55.3485154|-131.64671","34.9928552|135.8850515"]}.

With my new and improved table I found duplicate locations with UTF-8 characters and duplicate locations without UTF-8 characters.