I had two gigs this week that both involved emergency analysis and repair on web applications that had been around for a few years without much trouble, only to blow up right around the time that a critical demo or business evaluation was about to take place.
Of course, in both cases, the app didn’t blow up just then – after looking into things, it became clear that things had been broken for quite some time. One slipped off the radar because it was effectively on hiatus (no current customers, no active marketing, and while there was an active signup page, no new “random” signups, though as it turned out there might have been attempts but the app was broken,) and the other had a certain set of buffer mechanisms so a failure wouldn’t get noticed for some time.
As it happens, both broke after a server move, even though both were supposedly tested afterwards, but this can also happen after an OS update, a change to a third party API (here’s something else that came up: Salesforce passwords can expire!) or other changes not directly linked to the application source code, and these kinds of errors generally won’t hit the home page, which is where a lot of supposed uptime monitors check the “health” of the site (usually with an HTTP request that triggers an alert if it doesn’t return with a 200 (OK) response code.)
For your consideration, a simple list of things to do to prevent “old age” outages:
Written test plans. Yes, they’re a pain to write. Yes, they’re a pain to update. Know what? They’re the easiest things to delegate. You can give a proper test plan to just about anyone, inside your group or remote, and they can run through it. If your app has a decent amount of end user activity, this should only be necessary prior to an update (to the app or the server environment) since you’ll hear about outages quickly enough (though I’ve been amazed on some B2B applications how quiet customers can be,) but if your app is parked or only a percentage gets used in day to day operation, schedule a test run regularly.
Automated testing. As I mentioned above, most uptime monitoring sucks. You can find out that your home page is loading (and I’ve seen apps that break the home page but still return OK) and you can monitor disk space, CPU load, etc, but why not take advantage of the modern UI testing tools like Selenium and have something hitting specific parts of your production website on a regular, scheduled basis? Note that this isn’t a substitute for actual documentation – if the whole team leaves, the new gang is going to have a hard time figuring out what the app is supposed to do in “normal” use, and sadly, might not be proactive in finding out (I was paid well this week simply because someone put an app in production without the first clue of how it worked or what resources it required.)
And really, that’s pretty much it to get started – I’ve worked with a number of clients over the years, and most don’t even do this stuff. There’s obviously more that can be done, but if mere baby steps could be taken beyond “home page loads” I’d be happy for now.