Bulk API Data and Query Tools

Often api providers make bulk data available. The folks at patentsview.org have this page that lists what they've made available. They've even provided a data dictionary (spreadsheet) which explains exactly what they've made available. Their work is covered by the Creative Commons Attribution 4.0 License. Apparently all I have to do is credit the source, provide a link to the license and state if I've made changes. Oh, and I'm not to suggest that the licensor endorses me or my use of their data.

My android app and other pages on this site use an online version of location.tsv. I made two changes. The first was to remove the last three columns as I had no need for them. I also removed the 793 rows that had UTF-8 characters in them. Subsequent requests to the location endpoint failed when these cities and states were used. See utf8_locations.htm. I've also begun working on a new and improved locations table as explained here.

On my Government Organization page I use an online version of government_organization.tsv with several changes. I moved the name column to the end (after level_three) and display it as an empty string if it matches the level_one, level_two or level_three entry for that row. I also removed the the single binary character a0 in id 25.

On my withrdawn patent pages I used the patentsview's patents.tsv to figure out which ones are also in the uspto's withdrawn patents file.

On this page I am accessing an online version of botanic.tsv with two changes. The first was to drop the id column (first column in the spreadsheet). It's a string of 36 characters that I didn't see a need for. The second change was to delete the three rows of reissued data.

uuidpatent_idlatin_namevariety
9qskd3l0758uuilhx3w55c7cfRE46030Rubus idaeusAdvabertwee
cmrlj47ihepfnqzpriuh7hd43RE46041Rubus idaeusAdvabereen
jscvs1slb2sjau3u8zl2ayj08RE46031Rubus idaeusAdvaberimar


Which leaves us with 18,333 rows of data from PP33,801 issued 2021-12-28 through . Once the data is loaded in a database table we can do fun things like figure out which latin names are used most often.

patentsview.org botanic data
countlatin name
940Rosa hybrida
464Chrysanthemum×morifolium
295Calibrachoa sp.
275Prunus persica
269Rosa hybrid
258Hydrangea macrophylla
256Phalaenopsis hybrid
247Impatiens hawkeri
246Petunia×hybrida
203Pelargonium×hortorum


Or check for gaps in the data

Data Gaps
gap_starts_atgap_ends_at
PP-46040PP-46032
PP-46029PP15459
PP16339PP16339
PP22005PP22005
PP25724PP25724
PP25791PP25791
PP25810PP25810
PP25911PP25911
PP27441PP27441
PP29995PP29995
PP31408PP31408
PP31892PP31893
PP32102PP32102


I checked and the missing numbers correspond to withdrawn plant patents. Interestingly, there was another withdrawn plant patent in this time frame yet there is a row in the spreadsheet for it. PP20696 is in the patentsview database so the uspto and patentsview are not totally in sync. This page has more information on the withdrawn patents patentsview returns.

Odd Row
patent_notitleissue_datelatin_namevariety
PP20696WITHDRAWN2010-02-02Stenotaphrum secundatumPolaris


Separately, from the uspto, I have the plant patent issue dates and other fields. I could do some sort of mash up to say show the most popular latin names issued by year.

yearcountlatin name
2008109Rosa hybrida
200695Rosa hybrida
200576Rosa hybrida
200770Chrysanthemum×morifolium
201664Rosa hybrida
201363Rosa hybrida
202162Rosa hybrida
200660Chrysanthemum morifolium
201260Chrysanthemum×morifolium
201056Rosa hybrida


So let me know if you think of something more interesting to do with the bulk data that is available. Or you could download it yourself to see how much fun it is!

Query Tool

The nice thing with patentsview is that they also offer a query tool to do a custom extract that will produce a csv file you could download and open in excel. Your query could get you plant patent data like the issue date and title rather than downloading their 1G bulk patent file of 6.3 million patents when you just want data for the 18,336 plant patents covered in the botanic file. On the https://datatool.patentsview.org/query/'s advanced search screen set the select boxes as shown and click "+ Add to Search" for the three conditions (with an implied boolean AND between them).

Then click Submit Search (bottom of the page). Then Select what columns you'd like returned,



click Preview Query (bottom of the page), fill in an email address and prove you are not a robot. If you do all of that they'll send you a link that will download your csv file. How fun is that? The caveat that doesn't apply here is that your result set needs to be 1G or lower in size as that seems to be the extent of their helpfulness. Any larger and you'd have to resort to dealing with the bulk data files or asking for a database dump as explained in the email you'll receive. The peds (formerly pairbulk) api offers a similar query tool and download but they do not let you specify what columns you'd like. You get a lot of columns whether you want them or not. They also only offer download formats of json or xml. There is no csv option. How unfun is that?

If you have an interest in plant patents I have more information about them here on my main site.