Own your bad news

Oops! Sorry!

Photo by MJ Photography (iStock. by Getty Images)

What do you do when something goes wrong; when something goes wrong in an obvious and public way?

Should you tell your users that you’re dealing with it, even if at that point you have no idea what’s causing the problem? Or do you just buckle down and try to fix it as soon as possible, hoping that the outage has affected as few people as possible and that nobody really noticed anyway?

Personally, I’m a fan of getting the bad news out there and owning it. It’s something that was confirmed to me recently, once at work and again while trying to use the internet at home.

Case study #1: me

I started writing this post last week, but I ran out of time and saved it to draft to complete later. This week I was upgrading a few WordPress plugins and themes only to discover that the latest update (v.1.2.1) to the Blackbird theme broke four site homepages.

What to do: fix it in the hope that nobody noticed or publicize my foul-up and let everyone know there was an issue. It was at that point I remembered this post and took my own advice:

Twenty minutes later I tweeted the good news that I’d fixed the issue and a link to the solution in case other people encountered the same thing.

This latter post was retweeted once, by our own IT Service Desk.

Case study #2: BT

A couple of Saturdays ago, while I was sitting at my desk at home, one of my five year old boys popped his head into the study and said “something has happened to the big telly”. I hurried downstairs and was rather relived to discover that he simply meant they couldn’t access Netflix. Panic over.

After a bit of investigation I discovered that while there was a broadband connection, there were certain sites that our smart TV couldn’t access, including Netflix and the manufacturer’s site (to check for a firmware upgrade). I retired to my study to investigate further.

On my PC I discovered that there too I was only able to access certain sites; my mobile phone was doing the same, plus my wife’s tablet. Google was fine, Amazon wasn’t; I could access Facebook, but not Twitter. This was when I began to suspect the issue was to do with DNS, the system that maps domain names to the servers they are stored on (e.g. www.example.com maps to the IP address 93.184.216.119).

I tried to access a few sites directly by their IP address, and I could. Both Amazon and Twitter were available that way, so I simply changed my DNS settings to use Google Public DNS (8.8.8.8 and 8.8.4.4) and suddenly everything returned.

My internet service provider (ISP) is British Telecom (BT), so I checked their business service status updates page: there was nothing about this. I contacted @BTCare, the support channel that BT runs on Twitter. I heard nothing back.

I searched Twitter and there “BT DNS problem” was all but trending on the social network. So I put a query out on Facebook too and very quickly friends in Devon and East Anglia also reported that they couldn’t connect via BT (but could via their mobile network).

It was beginning to look like a UK-wide incident, but still there was no word from BT about it on any of the channels that I checked. It wasn’t until a couple of hours later that I checked the BBC website and read BT apologises for broadband problem.

@BTCare eventually posted an apology too

The feedback was interesting and not altogether unsurprising with many users saying that they had spent an hour or more trying to diagnose the incident, and others simply saying that they wished that BT had let people know sooner that there even was a problem.

37Signals

In their first two books 37Signals (now Basecamp) advise that you own your bad news:

When something goes wrong, someone is going to tell the story. You’ll be better off if it’s you. Otherwise, you create an opportunity for rumours, hearsay, and false information to spread. (Rework, Vermillion, 2010, p.231)

They advocate openness, honesty and transparency. Don’t keep secrets, they say, or hide behind spin. “Customers are usually happy to give you a little bit of breathing room as long as they know you’re being honest with them.”

I think that’s pretty solid advice, to be honest.

Advertisements

Drop the assumptions, re-frame the question

More than two years ago, on 7 November 2011, I created a new card on our Trello board called “Fix memos on internal homepages”.

It then sat on the board for the next 24 months, and was not touched, apart from to shuffled around the board a bit, from one list to another: from ‘backlog’ to ‘this week’ to ‘backlog’ to ‘known issues’ to ‘backlog’ to ‘bugs’ to ‘known issues’ to ‘backlog’ before it was finally scheduled to be done four weeks ago, three weeks ago, two weeks ago, one week ago… this week.

In the end it only took about 30 minutes to fix!

The issue

The issue in question was this: if a user visited the current staff, current students’ or current postgraduates’ homepage (and a few other school websites) under a secure connection (https) then they were not able to read the memos or events. Instead, they were served an error message, e.g.

An error has occurred fetching the memos.

An error has occurred fetching the memos.

The problem

The problem wasn’t affecting too many people, either. We would maybe receive one complaint about it every six months, so there was no urgency. But we considered it a significant issue to keep it on the board. For two years.

The reason that nobody fixed the issue was, I suspect, two-fold:

  1. Fear
  2. Prejudice

Fear

We feared that this problem would be very complicated and time-consuming to fix. Our fear led to inaction.

It sounds like it should be a complicated thing.

It sounds like we should need to know exactly what happens at both the server and browser levels when content is served using Hypertext Transfer Protocol Secure (HTTPS) compared with a non-secure communication.

It sounds like we may need to delve into server configurations or look up obscure chapters in books about PHP security to figure it out.

We didn’t.

Prejudice

Each time we discussed that card (as we shuffled it around the board trying to settle on the best list on which to ignore it) we spoke vaguely about what we suspected we needed to do, rather than simply stating the problem we were experiencing.

We always spoke about this incident with a particular solution in mind. And that solution was one that had emerged from our fearful speculation rather than simply stating the facts about the problem.

“Oh, that’s the problem where we have to force the pages to be loaded as http rather than https, isn’t it?”

And we would all agree.

“We’re going to have to do that using either .htaccess or PHP, aren’t we?” someone would continue.

And we’d all agree.

But we were wrong.

Drop the assumptions, re-frame the question

At the start of the week, I volunteered to look at this card.

I already had our pre-conceived idea at the ready and so I immediately spent a couple of hours reading up on forcing https pages to reload as http using .htaccess files.

“Boy this looks complicated!” I thought. “Surely there’s an easier way.”

And it was only at that point that I suddenly realised what I was doing. Or rather what I wasn’t doing. I wasn’t considering the actual problem itself; I was leaping straight to a solution.

The problem wasn’t that these pages weren’t forcing the page to be loaded under http, it was that the memos and events wouldn’t displayed if the pages were loaded under https. That’s a whole different issue.

I rewrote the ticket in Trello to reflect this: “Allow memos and events to be viewed under https“.

I then wrote myself an Agile-style story:

  • As a member of staff…
  • When visiting the staff homepage under an https connection…
  • I would like to view the events and memos, rather than an error message.

And it was at this point that I suddenly realised what the root cause of the issue was: the browsers were not pulling in the events and memos because they were hard-coded in the PHP and JavaScript files to be served under http.

So I changed these to protocol relative URLs.

Now, rather than referencing http://www.st-andrews.ac.uk/my-file.php I changed this to read //www.st-andrews.ac.uk/my-file.php so that the browser uses whichever protocol it’s currently receiving (see section 4.2 of RFC 3986 Uniform Resource Identifier (URI): Generic Syntax for details.)

And like I said, it took me about 30 minutes to update all six pages that were affected.

The lesson for next time: drop the assumptions, re-frame the question. Perhaps the answer is more simple than you first thought.

Why did the Website go offline today?

Missing jigsaw piece

In Jason Fried and David Heinemeier Hansson’s excellent book Rework: Change the way you work forever they offer this advice:

When something goes wrong, someone is going to tell the story. You’ll be better off if it’s you. Otherwise, you create an opportunity for rumours, hearsay, and false information to spread.

“When something bad happens, tell your customers (even if they never noticed it in the first place). Don’t think you can just sweep it under the rug. You can’t hide anymore. These days, someone else will call you on it if you don’t do it yourself. They’ll post it online and everyone will know. There are no more secrets.” (op. cit., p.231)

Something went wrong

Today something went wrong: what turned out to be a configuration error on the server rendered the University website unusable for about 2-3 hours this morning.

While the homepage could be viewed nothing below it was available. Google Chrome returned an HTTP Error 500 Internal Server Error; other browsers weren’t quite so helpful, and there was nothing in the Apache logs which gave us any clue about what was going on.

Working to fix it

While the IT Systems team (in the office upstairs) worked at trying to figure out what was going wrong the Web team (in our offices downstairs) did the best we could to alert IT Helpdesk and then respond to queries from users who were calling or tweeting to ask where the website had gone.

Knowing that it’s best to be up-front and honest, one of the first things I did was send a tweet from the @stawebteam account:

Currently investigating where the University website has gone... ^gjms

It took a couple of hours for the root cause of the problem to be unearthed and for all the pages to become available again.

The problem was two-fold: the server was running out of disk space which unearthed a problem with how PHP had been configured. While redundant temp files and directories were being deleted from the machine to free up some much-needed space a directory (an empty directory) that PHP thought it was relying on was deleted. The result being that every page on the website that relied on PHP (which is probably about 99% of the site) stopped working altogether.

What went right

I think we were right to tweet about the problem as soon as we knew. It was embarrassing to do so but it was right to be up-front about it and reassure those users who were following our tweets that the issue was being looked into. (It’s not as though we could have kept the issue hidden anyhow: you just had to try to visit the website!)

It was right to contact the IT Helpdesk early on and update them about the current status of the problem, as they would be receiving queries by phone, email and in person; when I called they reported that they had already received a few.

The communication between the Web team the IT Services systems team was very good and clear. I thought we complemented one another well and worked together to get the issue resolved as quickly as we could. There was no blame or ill-feeling one way or the other: only collaboration, which was great.

Once the site was brought back up we even bought the guy who did the bulk of the job a fudge doughnut to say thanks.

Lessons to be learned

I think today’s incident raised a number of issues about procedures for making changes and updates to live servers (and communicating that these changes are about to be made), about server configurations and the need for redundancy (that is providing a second web server to which we can switch should there be an issue with the first).

Needless to say we’ve already created a project to look into and resolve these issues as soon as we can.