Bulk API Data and Query Tools
Often api providers make bulk data available. The folks at patentsview.org have
this page that lists what they've made available. They've even provided a
data dictionary (spreadsheet) which explains exactly what they've made available. Their work is covered by the
Creative Commons Attribution 4.0 License. Apparently all I have to do is credit the source, provide a link to the license and state if I've made changes. Oh, and I'm not to suggest that the licensor endorses me or my use of their data.
My android app and other pages on this site use an online version of location.tsv. I made two changes. The first was to remove the last three columns as I had no need for them. I also removed the 793 rows that had UTF-8 characters in them. Subsequent requests to the location endpoint failed when these cities and states were used. See
utf8_locations.htm. I've also begun working on a new and improved locations table as explained
here.
On my
Government Organization page I use an online version of government_organization.tsv with several changes. I moved the name column to the end (after level_three) and display it as an empty string if it matches
the level_one, level_two or level_three entry for that row. I also removed the the single binary character a0 in id 25.
On my
withrdawn patent pages I used the patentsview's
patents.tsv to figure out which ones are also in
the uspto's
withdrawn patents file.
On this page I am accessing an online version of botanic.tsv with two changes. The first was to drop the id column (first column in the spreadsheet). It's a string of 36 characters that I didn't see a need for. The second change was to delete the three rows of reissued data.
uuid | patent_id | latin_name | variety |
9qskd3l0758uuilhx3w55c7cf | RE46030 | Rubus idaeus | Advabertwee |
cmrlj47ihepfnqzpriuh7hd43 | RE46041 | Rubus idaeus | Advabereen |
jscvs1slb2sjau3u8zl2ayj08 | RE46031 | Rubus idaeus | Advaberimar |
Which leaves us with 18,333 rows of data from PP33,801 issued 2021-12-28 through . Once the data is loaded in a database table we can do fun things like figure out which latin names are used most often.
patentsview.org botanic data |
---|
count | latin name |
940 | Rosa hybrida |
464 | Chrysanthemum×morifolium |
295 | Calibrachoa sp. |
275 | Prunus persica |
269 | Rosa hybrid |
258 | Hydrangea macrophylla |
256 | Phalaenopsis hybrid |
247 | Impatiens hawkeri |
246 | Petunia×hybrida |
203 | Pelargonium×hortorum |
Or check for gaps in the data
Data Gaps |
---|
gap_starts_at | gap_ends_at |
PP-46040 | PP-46032 |
PP-46029 | PP15459 |
PP16339 | PP16339 |
PP22005 | PP22005 |
PP25724 | PP25724 |
PP25791 | PP25791 |
PP25810 | PP25810 |
PP25911 | PP25911 |
PP27441 | PP27441 |
PP29995 | PP29995 |
PP31408 | PP31408 |
PP31892 | PP31893 |
PP32102 | PP32102 |
I checked and the missing numbers correspond to withdrawn plant patents. Interestingly, there was another withdrawn plant patent in this time frame yet there is a row in the spreadsheet for it. PP20696 is in the patentsview database so the uspto and patentsview are not totally in sync.
This page has more information on the withdrawn patents patentsview returns.
Odd Row |
---|
patent_no | title | issue_date | latin_name | variety |
PP20696 | WITHDRAWN | 2010-02-02 | Stenotaphrum secundatum | Polaris |
Separately, from the uspto, I have the plant patent issue dates and other fields. I could do some sort of mash up to say show the most popular latin names issued by year.
year | count | latin name |
---|
2008 | 109 | Rosa hybrida |
2006 | 95 | Rosa hybrida |
2005 | 76 | Rosa hybrida |
2007 | 70 | Chrysanthemum×morifolium |
2016 | 64 | Rosa hybrida |
2013 | 63 | Rosa hybrida |
2021 | 62 | Rosa hybrida |
2006 | 60 | Chrysanthemum morifolium |
2012 | 60 | Chrysanthemum×morifolium |
2010 | 56 | Rosa hybrida |
So let me know if you think of something more interesting to do with the bulk data that is available.
Or you could download it yourself to see how much fun it is!
Query Tool
The nice thing with patentsview is that they also offer a query tool to do a custom extract that will produce a csv file you could download and open in excel. Your query could get you plant patent data like the issue date and title rather than downloading their 1G bulk patent file of 6.3 million patents when you just want data for the 18,336 plant patents covered in the botanic file. On the
https://datatool.patentsview.org/query/'s advanced search screen set the select boxes as shown and click "+ Add to Search" for the three conditions (with an implied boolean AND between them).
Then click Submit Search (bottom of the page). Then Select what columns you'd like returned,
click Preview Query (bottom of the page), fill in an email address and prove you are not a robot. If you do all of that they'll send you a link that will download your csv file. How fun is that? The caveat that doesn't apply here is that your result set needs to be 1G or lower in size as that seems to be the extent of their helpfulness. Any larger and you'd have to resort to dealing with the bulk data files or asking for a database dump as explained in the email you'll receive.
The peds (formerly pairbulk) api offers a similar query tool and download but they do not let you specify what columns you'd like. You get a lot of columns whether you want them or not. They also only offer download formats of json or xml. There is no csv option. How unfun is that?
If you have an interest in plant patents I have more information about them
here on my main site.