In Jason Fried and David Heinemeier Hansson’s excellent book Rework: Change the way you work forever they offer this advice:
“When something goes wrong, someone is going to tell the story. You’ll be better off if it’s you. Otherwise, you create an opportunity for rumours, hearsay, and false information to spread.
“When something bad happens, tell your customers (even if they never noticed it in the first place). Don’t think you can just sweep it under the rug. You can’t hide anymore. These days, someone else will call you on it if you don’t do it yourself. They’ll post it online and everyone will know. There are no more secrets.” (op. cit., p.231)
Something went wrong
Today something went wrong: what turned out to be a configuration error on the server rendered the University website unusable for about 2-3 hours this morning.
While the homepage could be viewed nothing below it was available. Google Chrome returned an HTTP Error 500 Internal Server Error; other browsers weren’t quite so helpful, and there was nothing in the Apache logs which gave us any clue about what was going on.
Working to fix it
While the IT Systems team (in the office upstairs) worked at trying to figure out what was going wrong the Web team (in our offices downstairs) did the best we could to alert IT Helpdesk and then respond to queries from users who were calling or tweeting to ask where the website had gone.
Knowing that it’s best to be up-front and honest, one of the first things I did was send a tweet from the @stawebteam account:
It took a couple of hours for the root cause of the problem to be unearthed and for all the pages to become available again.
The problem was two-fold: the server was running out of disk space which unearthed a problem with how PHP had been configured. While redundant temp files and directories were being deleted from the machine to free up some much-needed space a directory (an empty directory) that PHP thought it was relying on was deleted. The result being that every page on the website that relied on PHP (which is probably about 99% of the site) stopped working altogether.
What went right
I think we were right to tweet about the problem as soon as we knew. It was embarrassing to do so but it was right to be up-front about it and reassure those users who were following our tweets that the issue was being looked into. (It’s not as though we could have kept the issue hidden anyhow: you just had to try to visit the website!)
It was right to contact the IT Helpdesk early on and update them about the current status of the problem, as they would be receiving queries by phone, email and in person; when I called they reported that they had already received a few.
The communication between the Web team the IT Services systems team was very good and clear. I thought we complemented one another well and worked together to get the issue resolved as quickly as we could. There was no blame or ill-feeling one way or the other: only collaboration, which was great.
Once the site was brought back up we even bought the guy who did the bulk of the job a fudge doughnut to say thanks.
Lessons to be learned
I think today’s incident raised a number of issues about procedures for making changes and updates to live servers (and communicating that these changes are about to be made), about server configurations and the need for redundancy (that is providing a second web server to which we can switch should there be an issue with the first).
Needless to say we’ve already created a project to look into and resolve these issues as soon as we can.