The “yearlong study”

The USPTO published a Federal Register notice entitled Setting and Adjusting Patent Fees during Fiscal Year 2020, dated August 2, 2020 (85 FR 46932).  The Federal Register notice said, in four places:

The USPTO conducted a yearlong study of the feasibility of processing text in PDF documents. The results showed that searchable text data is available in some PDFs, but the order and accuracy of the content could not be preserved.

On May 30, 2022 I wrote to Andrew Faile, the Acting Commissioner for Patents at the USPTO, asking for a copy of the “yearlong study”.  In response, on June 7, 2022, I received from him a document dated March 24, 2015 entitled Text2PTO Proof of Concept White Paper | Version 1.0.  You can view it and download it here.  By the way I wish to publicly thank Acting Commissioner Faile for providing this document to me.  I think his having provided the document shows some openness and willingness on his part to engage with the practitioner community on this DOCX issue in 2022.  I invite the practitioner community to join me in some hope that we can continue in some dialogue with Acting Commissioner Faile and his people, maybe eventually leading to a better initiative for text-based patent filing than what had been forced upon applicants and practitioners previously.

By way of background, during the last four years (from about 2018 to the present in 2022), the USPTO has been engaged in an initiative that is intended to force patent applicants to hand in their US patent applications in a format differing from formats in which US applicants had previously handed in their US patent applications.  Applicants had previously handed in their US patent applications in a wide range of variants of PDF format.  The USPTO has been trying to force applicants to hand in their patent applications in one of the variants of a DOCX format, namely the proprietary DOCX variant that Microsoft uses with Microsoft Word.

Given that the USPTO was going to try to force a behavior change from one mix of filing formats to some other single filing format, one might have thought that the “year-long study” was a study of the available PDF variants to see whether one of the PDF variants would be a good candidate for this purpose.  And indeed it takes just a few mouse clicks to learn that there are particular PDF variants that are extremely good candidates for this purpose, including PDF/A Level A (accessible) and PDF/UA.  

We will never know how it would have gone if four years ago the USPTO had selected PDF/A Level A (accessible) or PDF/UA format instead of Microsoft Word DOCX format as the format to try to force applicants to use in filing their US patent applications.  The USPTO’s choice of DOCX has been a disaster from the applicant and practitioner point of view, leading to what is essentially an adversarial relationship now between the USPTO and a significant portion of the practitioner community.  In contrast, I think if the USPTO had selected PDF/A Level A (accessible) or PDF/UA format as the format to try to force filers to use, things would have gone much more smoothly and productively.

Of course, what would have been smarter still is for the USPTO to actually engage the practitioner community in meaningful dialogue back when the USPTO was trying to arrive at a particular data format to try to force applicants to use.  That did not happen, but not for want of begging and pleading on the part of the practitioner community.  The begging and pleading fell on deaf ears.

Let’s turn now to the “year-long study” and let’s see what actually got studied.  The study starts with some background information:

USPTO currently accepts patent applications through its Electronic Filing System (EFS). EFS accepts PDF documents during application submission. Applicants can use various tools to create PDFs for EFS submission. More than 45% of these submitted PDFs have text behind them. Due to the differences in the COTS or open source tools used by the applicant to generate the PDFs, the format and structure of the PDFs differ. Currently, USPTO relies on OCR to extract text from TIFF representations of these submitted PDFs.

I imagine that you, like, me, wondered what “COTS” means.  It turns out that “COTS” is a contraction of the phrase “commercial off-the-shelf”.

The study was not, in fact, a study to figure out whether there is some PDF variant that would be a good candidate for forcing patent applicants to use.  The study was, instead directed to figuring out whether some existing software package, available in 2014, would be able to extract text successfully from a sampling of some actual PDF files that the USPTO had received in filed patent applications at some earlier time.  The study says:

Prototype testing will be done on samples provided by the USPTO team.

The authors of the study were not, in fact, surveying the many PDF flavors available in 2014 to see whether one of the flavors might work well for receiving text-rich US patent applications.  The authors of the study were, in fact, surveying the already-existing software tools available in 2014 to see whether one or another of the tools, when thrown at the actual mix of PDF formats that were being received from filers in 2014, could consistently extract most or all of what the USPTO hoped to receive in the way of text from applicants.

Of course the conclusion reached in the study was that no already-existing software tool was able to do this impossible task.  The outcome was completely predictable.  Recall what the authors said in their background section:

Applicants can use various tools to create PDFs for EFS submission. … Due to the differences in the [commercial off-the-shelf] or open source tools used by the applicant to generate the PDFs, the format and structure of the PDFs differ.

There was in fact no reason to hold out hope in 2014 that any already-existing software tool would be able to attack such a variety of PDF formats with success.  I think it is very likely that the authors of the study knew this from the start.  But had the authors pointed this out to the USPTO, there would have been no need to waste money on such a study, and the authors would not have received the money.  So of course the study went forward and reached its predictable “no” answer.

On a first quick reading of the study itself, I do not see anything that suggests that the authors intentionally skewed the results or made mistakes in their work.  Saying this differently, based on a first quick reading of the study, my sense is that the authors may well have carried out the study competently, given the wrongheaded framing of the study by the USPTO.  The problem is that the study does not actually inquire into things that were worth studying. The problem is a failure on the part of the USPTO to commission a study that would actually provide meaningful inputs for the decisions set forth in the Federal Register notice.

The study was directed to a question that nobody wanted to know the answer to.  The question addressed by the study was:

    • Suppose we don’t ask applicants to do anything differently than what they were doing in 2014.  Suppose they keep using whatever random PDF-generation tools they have been using in the past, and suppose they continue to generate PDFs in which “the format and structure” are uncontrolled.
    • Is there some existing software tool that can extract from such PDFs everything or nearly everything that the USPTO wants?

Of course the answer was “no”.

But the real situation at the USPTO was not that the USPTO planned to “not ask applicants to do anything differently than what they were doing in 2014.”  The real situation was that the USPTO did plan to force applicants to make drastic changes from what they were doing in 2014.  The USPTO’s plan was to force applicants to stop handing in patent applications in a variety of random PDF formats, and instead to force applicants to start handing in patent applications in some single format.

The USPTO apparently failed to give even a moment of consideration to the question of whether there might be some particular flavor of PDF that might give the USPTO everything or nearly everything that it wanted.  It is clear that nothing about the “year-long study” even touched upon this question.  As I say, this is not the fault of the authors of the study.  On a quick look, the authors of the study probably did get the correct answer to the question as posed by the USPTO.  It is just that the question posed by the USPTO was poorly selected.

And, of course, it is that the study was not needed at all;  the authors very likely knew the answer before they started, and if the USPTO had troubled itself to engage the practitioner community on this question, we could have told the USPTO the answer for free.  Anybody with any experience in this area surely already could figure out that the answer to this question as posed by the USPTO was “no” or perhaps more accurately “of course not”.  Of course it was a fool’s errand to try to get everything the USPTO wants if the starting point was some selection of patent applications in random and uncontrolled PDF formats.

Now we can return to what we can now realize is the extremely disingenuous two-sentence statement that appeared four times in that Federal Register notice:

The USPTO conducted a yearlong study of the feasibility of processing text in PDF documents. The results showed that searchable text data is available in some PDFs, but the order and accuracy of the content could not be preserved.

How might one edit these two sentences so that they would be honest and clear rather than disingenuous?  Here is one way to do it:

The USPTO conducted a yearlong study of the feasibility of taking, as a starting point, the wide range of random PDF formats that filers were filing in 2014, and using then-available software tools to try to extract from them all or nearly all of the character-based information that the USPTO hoped to obtain.  The results showed that given the wide range of random PDF formats that filers were filing, this was not a realistic goal.  The USPTO did not, however, give a moment’s thought to the possibility that there might be one or more particular PDF formats which, if filers could be convinced to use, might give the USPTO all or nearly all of the character-based information that the USPTO hoped to obtain.

Let’s return to one of the telltale starting conditions mentioned in the “yearlong study”.  The authors carefully point out:

Prototype testing [was] done on samples provided by the USPTO team.

Some readers will doubtless wonder whether the USPTO team might have already settled upon Microsoft Word DOCX format, and were perhaps spending this money on the “yearlong study” simply to provide support after the fact for a decision that had already been made.  If so, then such readers would wonder whether the USPTO team intentionally eliminated from the sample set all of the applications that had been filed in PDF/A Level A (accessible) and PDF/UA format, so as to reduce the risk that the “yearlong study” might cast some doubt upon the wisdom of the decision previously made?

My best guess is that the USPTO team did not consciously or intentionally tamper with the sample set to omit patent applications filed in PDF/A Level A (accessible) and PDF/UA format.   It is Hanlon’s razor that prompts me to make this guess.  Hanlon’s razor (Wikipedia article) says “never attribute to malice that which is adequately explained by stupidity.”  To accomplish a skewed outcome of the yearlong study, by tampering with the sample set, the USPTO team would have needed to learn enough about the field of available PDF flavors to know which ones work well for text extraction and which ones do not.  My best guess is that nobody on the USPTO team ever troubled himself or herself in 2014 to learn about things like PDF/A Level A (accessible) or PDF/UA formats.  If anybody on the USPTO ever did trouble himself or herself to learn about this, I suspect it only happened now in 2022, informed by being a lurker on the EFS-Web listserv in which these formats have been discussed in detail.

What’s more, if in 2014 the USPTO team had troubled itself to learn about things like PDF/A Level A (accessible) or PDF/UA formats, I think there is a fair chance the USPTO team would have followed a completely different path, perhaps conducting feasibility work with these particular formats in cooperation with the practitioner community.