Own your bad news

Oops! Sorry!

Photo by MJ Photography (iStock. by Getty Images)

What do you do when something goes wrong; when something goes wrong in an obvious and public way?

Should you tell your users that you’re dealing with it, even if at that point you have no idea what’s causing the problem? Or do you just buckle down and try to fix it as soon as possible, hoping that the outage has affected as few people as possible and that nobody really noticed anyway?

Personally, I’m a fan of getting the bad news out there and owning it. It’s something that was confirmed to me recently, once at work and again while trying to use the internet at home.

Case study #1: me

I started writing this post last week, but I ran out of time and saved it to draft to complete later. This week I was upgrading a few WordPress plugins and themes only to discover that the latest update (v.1.2.1) to the Blackbird theme broke four site homepages.

What to do: fix it in the hope that nobody noticed or publicize my foul-up and let everyone know there was an issue. It was at that point I remembered this post and took my own advice:

Twenty minutes later I tweeted the good news that I’d fixed the issue and a link to the solution in case other people encountered the same thing.

This latter post was retweeted once, by our own IT Service Desk.

Case study #2: BT

A couple of Saturdays ago, while I was sitting at my desk at home, one of my five year old boys popped his head into the study and said “something has happened to the big telly”. I hurried downstairs and was rather relived to discover that he simply meant they couldn’t access Netflix. Panic over.

After a bit of investigation I discovered that while there was a broadband connection, there were certain sites that our smart TV couldn’t access, including Netflix and the manufacturer’s site (to check for a firmware upgrade). I retired to my study to investigate further.

On my PC I discovered that there too I was only able to access certain sites; my mobile phone was doing the same, plus my wife’s tablet. Google was fine, Amazon wasn’t; I could access Facebook, but not Twitter. This was when I began to suspect the issue was to do with DNS, the system that maps domain names to the servers they are stored on (e.g. www.example.com maps to the IP address 93.184.216.119).

I tried to access a few sites directly by their IP address, and I could. Both Amazon and Twitter were available that way, so I simply changed my DNS settings to use Google Public DNS (8.8.8.8 and 8.8.4.4) and suddenly everything returned.

My internet service provider (ISP) is British Telecom (BT), so I checked their business service status updates page: there was nothing about this. I contacted @BTCare, the support channel that BT runs on Twitter. I heard nothing back.

I searched Twitter and there “BT DNS problem” was all but trending on the social network. So I put a query out on Facebook too and very quickly friends in Devon and East Anglia also reported that they couldn’t connect via BT (but could via their mobile network).

It was beginning to look like a UK-wide incident, but still there was no word from BT about it on any of the channels that I checked. It wasn’t until a couple of hours later that I checked the BBC website and read BT apologises for broadband problem.

@BTCare eventually posted an apology too

The feedback was interesting and not altogether unsurprising with many users saying that they had spent an hour or more trying to diagnose the incident, and others simply saying that they wished that BT had let people know sooner that there even was a problem.

37Signals

In their first two books 37Signals (now Basecamp) advise that you own your bad news:

When something goes wrong, someone is going to tell the story. You’ll be better off if it’s you. Otherwise, you create an opportunity for rumours, hearsay, and false information to spread. (Rework, Vermillion, 2010, p.231)

They advocate openness, honesty and transparency. Don’t keep secrets, they say, or hide behind spin. “Customers are usually happy to give you a little bit of breathing room as long as they know you’re being honest with them.”

I think that’s pretty solid advice, to be honest.

Advertisements

Drop the assumptions, re-frame the question

More than two years ago, on 7 November 2011, I created a new card on our Trello board called “Fix memos on internal homepages”.

It then sat on the board for the next 24 months, and was not touched, apart from to shuffled around the board a bit, from one list to another: from ‘backlog’ to ‘this week’ to ‘backlog’ to ‘known issues’ to ‘backlog’ to ‘bugs’ to ‘known issues’ to ‘backlog’ before it was finally scheduled to be done four weeks ago, three weeks ago, two weeks ago, one week ago… this week.

In the end it only took about 30 minutes to fix!

The issue

The issue in question was this: if a user visited the current staff, current students’ or current postgraduates’ homepage (and a few other school websites) under a secure connection (https) then they were not able to read the memos or events. Instead, they were served an error message, e.g.

An error has occurred fetching the memos.

An error has occurred fetching the memos.

The problem

The problem wasn’t affecting too many people, either. We would maybe receive one complaint about it every six months, so there was no urgency. But we considered it a significant issue to keep it on the board. For two years.

The reason that nobody fixed the issue was, I suspect, two-fold:

  1. Fear
  2. Prejudice

Fear

We feared that this problem would be very complicated and time-consuming to fix. Our fear led to inaction.

It sounds like it should be a complicated thing.

It sounds like we should need to know exactly what happens at both the server and browser levels when content is served using Hypertext Transfer Protocol Secure (HTTPS) compared with a non-secure communication.

It sounds like we may need to delve into server configurations or look up obscure chapters in books about PHP security to figure it out.

We didn’t.

Prejudice

Each time we discussed that card (as we shuffled it around the board trying to settle on the best list on which to ignore it) we spoke vaguely about what we suspected we needed to do, rather than simply stating the problem we were experiencing.

We always spoke about this incident with a particular solution in mind. And that solution was one that had emerged from our fearful speculation rather than simply stating the facts about the problem.

“Oh, that’s the problem where we have to force the pages to be loaded as http rather than https, isn’t it?”

And we would all agree.

“We’re going to have to do that using either .htaccess or PHP, aren’t we?” someone would continue.

And we’d all agree.

But we were wrong.

Drop the assumptions, re-frame the question

At the start of the week, I volunteered to look at this card.

I already had our pre-conceived idea at the ready and so I immediately spent a couple of hours reading up on forcing https pages to reload as http using .htaccess files.

“Boy this looks complicated!” I thought. “Surely there’s an easier way.”

And it was only at that point that I suddenly realised what I was doing. Or rather what I wasn’t doing. I wasn’t considering the actual problem itself; I was leaping straight to a solution.

The problem wasn’t that these pages weren’t forcing the page to be loaded under http, it was that the memos and events wouldn’t displayed if the pages were loaded under https. That’s a whole different issue.

I rewrote the ticket in Trello to reflect this: “Allow memos and events to be viewed under https“.

I then wrote myself an Agile-style story:

  • As a member of staff…
  • When visiting the staff homepage under an https connection…
  • I would like to view the events and memos, rather than an error message.

And it was at this point that I suddenly realised what the root cause of the issue was: the browsers were not pulling in the events and memos because they were hard-coded in the PHP and JavaScript files to be served under http.

So I changed these to protocol relative URLs.

Now, rather than referencing http://www.st-andrews.ac.uk/my-file.php I changed this to read //www.st-andrews.ac.uk/my-file.php so that the browser uses whichever protocol it’s currently receiving (see section 4.2 of RFC 3986 Uniform Resource Identifier (URI): Generic Syntax for details.)

And like I said, it took me about 30 minutes to update all six pages that were affected.

The lesson for next time: drop the assumptions, re-frame the question. Perhaps the answer is more simple than you first thought.