My guesses as to what caused yesterday’s massive system crash

I have some guesses as to what went wrong to bring down all of those USPTO systems yesterday evening. 

Yesterday at about 7PM Eastern Time, at least the following USPTO systems crashed:

  • Patentcenter
  • EFS-Web
  • EFS-Web contingency
  • Private PAIR
  • TEAS

Here is what the USPTO posted about the crash about an hour and a half later, at about 8:30 PM Eastern Time:

Due to unplanned maintenance, our external IT systems are currently offline today (December 15, 2021). We expect our systems to be offline for the remainder of the day, and that the systems will be up and running tomorrow. 

The only information that I have about this massive crash is whatever one sees as an external observer.  A first thing to realize is how similar this posting is to the things that Captain Kirk used to hear from Scotty in Star Trek (the original series).  What happened in about every third episode was that in about the third quarter, some bad thing had just happened to the Starship Enterprise.  The warp core had overheated and shut down, or whatever.  Kirk would call down to the engine room asking Scotty how long it was going to take to get the warp core working again.

Now I don’t know about you, but whenever I am trying to fix a thing that is broken that somebody else is relying upon, I almost never have even a ghost of a chance at predicting when I will get the thing working again.  But somehow Scotty was invariably able to say just how many minutes it would take to get that pesky warp core working again.  (Oh, and whatever number of minutes it was, that number of minutes always seemed to fit just right into the remaining number of minutes of screen time to the end of the episode.  Who would have thought?)  Whenever I was sitting in front of the television watching such an episode of Star Trek TOS, and Scotty said something like that, I shook my nine-year-old head in disbelief.

So anyway, look at the USPTO posting.  Just like Scotty, the anonymous author of that posting actually stuck his or her neck out and predicted what day things would be “up and running” again.

Pretty weird, huh?

So what I am guessing is that maybe some single-point-of-failure server had failed, and yes they sort of had a backup set of hardware, and yes they sort of had a disk image of what would be needed to configure the hardware to build the replacement server, but then it was actually fairly predictable how long it was going to take to load the disk image into the machine and then load the underlying database backups into the machine, and so on.

Which machine?

Well, I have a guess about that too.  Early this morning, during the massive outage, emails arrived letting me know about the outgoing patent correspondence from the USPTO.  And emails arrived from the MyUSPTO patent docket and trademark docket monitors.  All of this gives a strong impression that lots of the bread-and-butter internal systems at the USPTO were running all through yesterday evening and all through the night more or less as normal.  

And the USPTO’s posting itself reinforces this impression.  It characterizes the crash as limited to “our external IT systems”.   But not really all of their external IT systems.  For example, the main web page seemed to be working as usual.  And email is an external IT system, yet it was working just fine and indeed was apparently working so well that email became USPTO’s communications path of choice for contingency e-filing of patent papers and trademark papers.

And one more thing.  When the systems all sort of abruptly became available again, between about 7AM Eastern Time today and 8AM Eastern Time today, the abrupt restoration of function happened at about the same time for all of the systems.  So for example it was not that maybe TEAS came up first, and later PAIR, and still later Patentcenter.  No.  It was all pretty much simultaneous.  And when the systems did rather abruptly become available again, all of the systems seemed to be working just as usual.  Not limping along.

No, it was not all of their external IT systems that crashed.  As best I was able to see, the massive crash was limited to only the external IT systems for which customers need to log in and prove who they are.  

So my best guess is, what went wrong last night around 7PM Eastern Time was a crash in the server that has to be working for external USPTO customers to be able to log in.  Of course what I do not know is the cause of that crash.  It must not have been as simple as somebody tripping over a power cord and knocking a power plug out of a receptacle.  That would simply have required plugging it back in and following some procedure for rebooting something.

One could imagine a simple hardware failure in that server.  It’s not out of the question.

Of course what rings in all of our ears is the things that we heard (well, six thousand of us heard) when we attended the webinar that took place two days ago, hosted by the Commissioner for Trademarks, telling us about the new measures that the Commissioner has set into place that will require all trademark filers to prove exactly who they are.  Every trademark filer will need to go through an extremely invasive identity verification procedure some time between January 8, 2022 and April 8, 2022, and if they fail to do that, they will find that starting on April 8, they will not be able to use the TEAS system any more.  

Maybe there is something about USPTO’s activities to implement this new identity verification procedure that somehow took a wrong turn yesterday evening?  Somebody was trying to load the latest version of firmware into the server that handles external customer logins, and accidentally reformatted the hard drive?

If you let your imagination run, you could conceive of somebody deciding they don’t like this new identity verification thing and launching a denial-of-service attack on the server that makes it work.  But I am not putting my money on that one, for the simple reason that if somebody had crashed that server by external means, the USPTO people would have been unable to “do a Scotty” and predict so confidently just how many hours it was going to take to get the warp core working again.

No, my money is on some hapless beltway bandit government contractor person accidentally reformatting a hard drive at exactly the wrong time, something like that.

Oh, and if indeed that is what it was, let me say that there is a correct time to schedule risky stuff like that.  The correct time for it is 12:30 AM or so.  Not 7PM.  That way, if it goes terribly wrong, it does not make trouble for people who need to e-file a patent application or trademark application and who need to get a same-day filing date.

Oh, and I wonder if I need to mention yet again the need to move the “contingency” patent e-filing server to some location that is geographically distant from the main server, and that is connected to electric power in a different way, and that is connected to the Internet in a different way.  Oh, and while we are at it, I have not heard a peep from the USPTO about any provision ever having been made for a “contingency” trademark e-filing server.  

8 Replies to “My guesses as to what caused yesterday’s massive system crash”

  1. My opinion is that they were scrambling to patch the log4j vulnerability on their external facing systems. Not all their systems were written in Java, so not all had to be taken down.

    1. You beat me to it, Chris. My first guess was that they were trying to patch log4j without taking their systems down and overlooked some dependency that caused the crash.

      We’re just lucky that most of the USPTO internal systems are coded in Fortran, so they couldn’t have relied on a Java framework like log4j. Good thing, too, because otherwise somebody would have to run over to Crystal City and retrieve the punch cards from a basement storage cage.

  2. Regarding your comment on the identity verification requirement for trademark filers, if my understanding was/is correct, those of us who are already verified by the USPTO for our MyUSPTO accounts on the patent side will not need to undergo the rigourous proof of identity requirement. Again, if I recall correctly, there will be a process by which we can use that account to satisfy the new trademark verification of identity requirement.

    Then again, they mentioned towards the end of the Q&A that, based on the same questions coming in over and over, they hadn’t/mustn’t have really explained it too well. We’ll see come January!

  3. I also thought it might be a cold-turkey patch evening to install fixes for Log4j.

    However, if Carl is right that this outage arose from something that went awry when some prospective work was done for authentication, or anything similar, it’s reasonable to ask why the PTO does not have a test environment in which that stuff can be extensively tested, so that if the upgrade or installation is to fail, that happens in the test environment and not on a live system that external users rely on.

    I’d especiially commend Carl’s point that “there is a correct time to schedule risky stuff […] 12:30 AM or so.” And that goes for just about any work that may affect an entire system or application.

  4. As a side note for interested readers, Carl did not mention that as a kid when he watched Star Trek (original series) it was on a _color_ television that he built from a Heathkit.

    1. Ruth, what a priceless insight. May I suggest that you write a blog here on ALP with more of these. Perhaps a Top Ten listing of sorts.

Leave a Reply

Your email address will not be published. Required fields are marked *