Earlier today, T-Bot Rewritten had been experiencing downtime because I forgot one thing and assumed too early that one layer would be sufficient to ensure service availability
How it started
I updated all packages on the application server without too much of a thought. A big mistake, as I later found out.
I use the
forever npm package to automatically restart my web services in the event they fail. Because of this, it did not for a second occur to me that this does not cover system reboots.
After the updates had concluded, the system began restarting itself to complete patching system files. Then suddenly, all users were greeted by the following error while trying to log in into T-Bot Rewritten:
Ouch! That is not supposed to happen. The web service responsible for authentication and authorization was offline. This marks the first moment in T-Bot history that we have experienced downtime beyond my own control.
A collection of many inconveniences
For security reasons, all applications running under the
icseon.com domain can only be accessed from an internal network. Because of this, we were both unable to do anything about the downtime until I was at home.
My own security essentially prevented me from doing mandatory work that is required to bring back my services back online.
After this incident, I have decided to allow myself to access the application servers through FIDO2 keys that store public SSH keys. I realized that locking myself out is not the brightest idea, even if that means great security.
Fixing the problem
After 9 hours of unavoidable downtime because of what happened, I got home. At first this seemed easy, just make an initd service that automatically starts when systemd does, until I came to a realization:
You can not invoke
npm run through the
ExecStart parameter that services give you to perform operations on system boot. Tricky. For a while, I was sitting on my chair, dumbfounded by how simple it seemed to do something like this: run something on system boot.
After some investigation, I tried to invoke
npm directly by invoking
/usr/bin/npm followed by the production environment variable amongst other flags. This worked! To be sure, we rebooted the server many times to make sure that our idea worked and sure enough, it did. The problem was solved, and we were starting to see traffic come in again. A success!
What I learned
I must always ensure that the services are also automatically started in the event of a reboot of the physical server that the services are running on. I did not cover this and this was a mistake on my end. I acknowledge that this was a learning moment and am glad that this happened now, rather than later.
I hope you learnt something from my mistake!