Patentsview API Bugs
My expectation is that the patentsview data would match the data in the uspto's
ppubs database but I'm seeing these
major differences in no particular order.
- Nearly 8,000 withdrawn patents are returned by the api. I raised this as an issue but it was closed without being fixed. See #18 below
- Plant patents do not have cpcs where appropriate. It was another issue that was closed without being fixed.
The problem is that the bulk Cooperative Patent Classification file
produced by the uspto only contains assignments for utility patents.
See numbers 3, 15, 19, 28, 33 and 41 below.
- There are a lot of problems with locations due to the underlying data. There needs to be a disambiguation effort like there was for inventors. See below.
- There are two problems with uspc classifications. Background: the uspto has been using its own classification system for at least
a hundred years. In the 2010s most of the world's patent offices agreed to start using a new Cooperative Patent Classification system.
As such, the uspto stopped assigning uspcs to utility patents issued after May 2015, and now exclusively assigns cpcs. The uspto does continue to assign
uspcs to design patents,
plant patents and reissued plant or design patents. This is an important distinction: the uspto still assigns uspcs to non utility patents.
- The first major problem is that the patentsview team mistakenly thinks that all patent types stopped receiving uspc assignments in 2015.
- The second major problem is that the uspto bulk uspc file the patentsview api uses has not been updated in two year and this
appears to be intentional on the uspto's part
(as if the uspto itself doesn't understand the important distinction).
This page says the file will stop being produced in January 2020 and the
file that is available is from 2018.
The last plant patent in mcfpat.zip is PP29260 and D816289 is the last design patent, each issued April 24, 2018.
Combining these two problems and problem 2 means that plant and design patents issued after March 2015 will not be returned by the
api's cpc_subsections or uspc_mainclasses endpoints.
In other words, you can do queries that should return these non utility patents but they will not be included in the returned results.
I consider this to be a critical flaw that needs action by the uspto and patentview team to correct. I could work around the api returning
withdrawn patents etc. but I cannnot work around data that is not returned by the patentsview api.
This page shows the null uspc
classifications coming back from the api for the most recently issued plant patents. Also provided are links that go to the uspto's site showing the uspc
classifications the api should be returning.
Here's a page showing recently issued design patent data returned by the api, which do not have uspcs.
- There are 305 missing patents. They aren't in the bulk xml files provided by the uspto. See 43 and 44 below.
- You cannnot query for the same field using an "and" as you can in just about any api. Ex. you won't get results if you search for patents with an
inventor_last_name of smith and inventor_last_name of jones. Spoiler alert, they exist according to ppubs! See samefield.htm
and samefield2.htm
Numbers 1 and 4A could be fixed by the patentsview team. They should not load the withdrawn patents they encounter in the grant xml files (ones subsequently withdrawn after being issued) into their
database. I don't know of any other system that returns data for withdrawn patents as they do.
There is an outdated bulk file available of uspcs so there is no reason the api should not return them.
Numbers 2, 3, and 5 are problems with the underlying data the uspto makes available to anyone, including the patentsview team. The underlyling files would need to be fixed
by the uspto before these problems could be fixed in the patentsview database. The locations problems (number 3) also exist in the ppubs database.
More information about these problems is
here.
Specific examples of these problems follow.
- Null states (the state field is not populated) are being returned by the locations
endpoint.
- Country of US with states whose length is not two
- I found a bunch of plant patents on ustpo.gov that have cpcs!
patentsview is not returning this data as shown on this page.
- The sort on patent number isn't quite right
for plant patents and is a problem for utility patents now that patent numbers over 10 million have been issued.
- Odd cities
- This one is on me (my mistake) but it exposes an opportunity to add
additional input validation. For the sort parameter I entered
["location_city"] instead of [{"location_city":"asc"}]. The web server
threw an Internal Server Error error (return code 500) and did not add the ultra helpful X-Status-Reason header as it does when it rejects requests for being invalid (return code 400).
- Duplicate locations are back. There are also UTF-8 duplicate locations.
- I get an X-Status-Reason: The operation '_text_any' is not valid on 'location_city''. It works on patent_title as shown on the query language
page but not on location_city. Does it only work on indexed fields or something?
Is this a bug or just something that needs to be documented?
Oops, this is another one on me. _text_any is for fields of type full text while _contains is for fields of type string. The endpoint pages show the field type which then determines what operators can be used. Now that I know that the X-Status-Reason makes way more sense!
- I needed to double up the single quotes to find the city Hashmona'im
q={"location_city":"Hashmona''im"} With just one single quote no results were found as shown here.
- Update: Posts seem to be working now!
Originally I could not get the post endpoints to work in a browser.
It appears the problem is the Access-Control-Allow-Headers: * header. It needs to be
Access-Control-Allow-Headers: origin, content-type, accept
Here's a page on my web site trying to post to the patents endpoint. It does not
work in firefox or chrome but did work in an older version of opera I
happened to have.
- Similarly there needs to be a access-control-expose-headers: X-Status-Reason
so the ultra helpful X-Status-Reason will be displayed in swagger (this is still not working though posts are working. I opened an issue for this.)
- There are 134,620 patents (92 locations) where the city and state
are not filled in.
- I found what looks like a utility patent with a patent type of Plant
as shown here.
I'm not sure if it's wrong in the underlying uspto data or if something else is wrong.
- Here's the opposite case, plant patents with a
patent type of something other than Plant.
- Cooperative Patent Classifications (cpcs) and US Classifications (uspcs) don't come back on recent plant patents as shown on
this page.
- There are some wacky countries in the disambiguated location table
- There are also odd US states
- Patentsview returns nearly 8,000 patents that the uspto says are withdrawn.
- Update: this appears to be fixed! cpc's did not seem to be coming back on recently issued utility patents
- I'm not able to get data back from the location endpoint for cities with UTF-8 characters in them. Also see this page.
- There are some cities with very long and odd looking names
- Only the sequence number is being returned for pre-1976 references. This was brought up in the patentsview forum
- government_organization in the data dictionary has just an id column and two columns named level_two yet government_organization.tsv has 5 columns whose header is organization_id, name, level_one, level_two, level_three. It appears that name is a fourth level when it doesn't match level_one, level_two or level_three as shown here. Except for id 165 that is, where the name is Energy Efficiency and Renewable Energy and level_two is Office of Energy Efficiency and Renewable Energy. It's not just a typo in the spreadsheet, it is possible to get results where
q={"govint_org_name":"Energy Efficiency and Renewable Energy"}. And if it truely is a hierarachy, there is no previous level for ids 164, 161, 157, 172, 177 and 187. Also ids 157 and 177 contain stray three character strings xa0, id 25 contained the binary character x0, ids 178 and 165 are duplicates as are ids 186 and 18 and 19 and 25.
- The bulk download page says this about uspc_current "Current USPC classification data for all patents up to May 2015" This is true for utility patents, the uspto stopped assigning them uspc classifications but plant and design patents continue to receive uspc classifications as do reissues of either of those two types. The bulk file as well as the patentview database should contain uspc classifications beyond May 2015.
- It looks lke you can't _and in a single field as shown here
- There are a number of govint_org_id that do not have patents assigned to them.
- html markup and entities come back in the patent_title field on recently issued plant patents.
- The locations endpoint throws 400 (bad input) and 500 (technical difficulties) errors when querying for fields their documentation says you can retrieve. ex: invalid field specified: cpc_sequence See
https://github.com/PatentsView/PatentsView-API/issues/24, https://github.com/PatentsView/PatentsView-API/issues/29 and api documentation
- There are 122 locations where the country is not a two letter code.
- There is an inconsisenty in the data on whether Puerto Rico is a US State or a country.
- There are US locations impossibly close to each other. Further support for the big problem about to be mentioned. It would probably take a disambiguation like effort to clean this up.
- There are almost 1400 locations whose city starts with a digit. 40 are in the US and 1,358 are non US
- cpcs don't come back on plant patents (mentioned above) or reissued utility and plant patents as they should. The root cause apears to be the bulk cpc file from the uspto, it only contains utility patents.
- Canadian provinces are not used consistently.
- There are locations where the number of patents promised does not equal the number of patents delivered.
- There is a location with a null country and there are other countries containing a double quote.
- The query tool says "States data is only available for the United States" The state field is populated for non US countries. You could query for Canadian provinces as just mentioned.
- There are problems with plant patents in the bulk data file government_interest.tsv. Some patent number numbers have the form of USPP026833 while others have a zero after the PP etc PP05345. 74 plant patents come back from the api but there are 135 plant patents in government_interest.tsv. See.gov_plants.htm. patent_govintorg.tsv has the same PP9 and USPP entries.
- There are overly escaped double quotes in assignee_organization as shown in these odd assignees
- There are also weird assignee_organizations
- Recently reissued patents don't seem to have cpcs at the uspto or in patentsview.
- cited_patent_category seems to always be null. The api throws a 500 error if you try sorting on it. ex: s=[{"cited_patent_category":"asc"}]
- There are 266 patents missing in the patentsivew database because they are miscategorized by the uspto
- There are an additional 39 missing patents for different reason.
- There are 419 location cities that start out "Late of" to indicate a deceased inventor.
- There are 59 locations that contain "hacek over".
- The International Patent Classifications returned by the api are the ones when the patent was issued, not the potentially updated ones that ppubs could return.
There are bulk files for (with issues noted here) for cpc and uspc classifications but I do not think there is a bulk source of IPCs. See ipcs.htm
- A user raised a git issue saying there are over 200,000 errors in the latitudes and
longitudes assigned to inventor and assignee locations.
Most of the location bugs are due to the underlaying uspto data. It is, for example, possible to search for an inventor's city that contains 200:
"200$".INCI.c/200$ at ppubs.uspto.gov. You might think you are getting back all the patents for a particular location from patentsview but you wouldn't if the uspto data contains errors. This can be seen on
this page where the searches are by latitute and longitude. You'll most likely see odd locations nearby that have patents associated with them. Ex: patents where the city is Los Angeles International Airport. The patents associated with the odd location wouldn't be returned by the patentsview location endpoint for a query where the city is Los Angeles. To me this is a big problem that is not mentioned anywhere I could find. The location endpoint should at least have a disclaimer explaining that you may not get back all the patents you may expect. To compensate for this data problem I think there should be an endpoint or parameters on the location endpoint to search within a specific distance from a specific latitude and longitude. I'd then be able to retrive patents within a mile of downtown Los Angles regardless of the "city" (airport, county or labled rock on the roadside) that the uspto has as the city.
A possible alternative would be to do some sort of consolidation of locations, changing the odd locations into standard ones.