How we approach solving a technical issue
Fieldmotion, like all software companies, spends a lot of time managing issues. It’s not always full steam ahead on feature development. It can be said that every new line of code is a new issue waiting to happen. In this article I’ll describe how we generally manage the inevitable problems that arise.
The key to solving any issue is to follow a few very simple rules:
- Don’t panic.
- Write down clearly what the issue is.
- Use your logs and logic to find the cause of the issue and correct it.
- Think about how you’re going to stop the issue from happening again.
There are libraries of books written about how to solve problems so this is all generic. If you want specifics and really in-depth detail, I recommend reading the book Site Reliability Engineering, written by a load of SREs at Google. It’s been weighing down my laptop case for at least a month as I digest it, and I may read it again a few times before the year is out.
Panic is the enemy of clear thinking. Only a few days ago, I had to manage an issue where there were at least four separate people in my room, all giving me their version and diagnosis of a sudden issue that had happened. I couldn’t get a clear story out of one before another would chime in with something else. In the end, I had to ask them all to please just tell me in simple terms what the issue was, and then get out of the room.
It is important when solving an issue to first calm everyone down so you can get the story cleanly and without embellishment.
By embellishment, I mean that if someone is describing an issue, then they should describe the symptoms clinically, without bias, honestly admitting anything they may have done that could have led to the issue, and without trying to definitely state the cause when the cause has not yet been determined through evidence.
Don’t rely on memory. Too many times, I’ve heard people say “I didn’t do that”, or “That’s not what happened”, or “I’m sure I clicked that”, etc. Human memory is a funny thing – memories are plastic in it and can be molded by what you want to be true. You need to get the story calmly, and clearly, and as factually as possible, without opinions or other biases.
Write the issue down in as few words as possible, and as precisely as possible. An example of a good description is “user account 123 has item 456 in it that the user says should not be there”. Your job now is clear – to find out how item 456 was added to user account 123, to explain the cause, and if it is determined to be a bug, you need to fix the bug and make sure it cannot happen again.
Data is cold and hard – it sits in your hard-drives and only changes when the servers are told to do something with it. (Yes, there is a such thing as “bit rot” that can randomly change data, but that’s another story). And if you’ve built your software well, those changes usually leave a trace in a log somewhere.
If you slowly and carefully trace through your logs to tease apart events leading up to the undesired state, you can usually find the cause of a problem and be certain (or as close as is possible) that you have it.
Logs are so vital to this job that if we ever come across a case where something happens and we can’t trace the events leading up to it, we make an educated guess as to what might have happened, and then put extra logs in place around the code in that area, so if it happens again, we will know more. This has sometimes let us solve very strange one-in-a-million problems where the events only happened maybe twice, and years apart. We might not solve rare issues the first time around, but the next time they happen, we’re ready.
When correcting the source of an issue, you need to be sure you have it completely rooted out before you start correcting the user data, as otherwise it may just happen again before you fix the issue. Solve the issue first, and then correct the data. Sometimes you need to fix the data in the interim if it’s time-sensitive, but this means you will need to check and fix the data again after your fix.
Sometimes the data will turn out to be unrecoverable because of the nature of the issue. In that case, recover what you can from backups, fix the issue, inform the client as to what happened, and make sure you have safeguards in place to make sure this does not happen again. An example we had recently involved a sporadic bug in an external library that caused some file uploads to fail. We wrote a workaround, added checking code to ensure the upload had worked before reporting its completion (and asynchronous retries in case of a failure), recovered data, and informed the client.
When deciding how to make sure that an issue doesn’t happen again, it’s important to understand that if the issue was human-caused, the solution is never to punish the human. That doesn’t work. Instead, you need to work out how to make it harder for the human to make the same mistake, and automate the task instead if possible.
An example of how we make mistakes harder to make: we make it possible to delete multiple jobs at the same time. The usual pattern in this kind of thing is that you would tick some checkboxes next to the jobs you want to delete, then click Delete Selected, and a confirmation box will appear saying “are you sure?”, which you will click Yes to. This is a mistake. People are so used to those popups that they will just click it without even thinking. We’ve diagnosed issues where people swore they hadn’t clicked the popup, until we showed them the logs proving that they had.
The solution in this case is actually to have the client type in a short sentence, like “I am sure”. This makes them think a little more about whether they actually are sure they want to do this. We rarely get this type of issue reported to us now, and no-one complains about the minor inconvenience of spending an extra two seconds typing.
We once had a client report an issue, and then after we determined the issue was human-caused and had recovered from the error, the client asked us how we intended to punish the person who made the mistake. I was shocked at the question! Mistakes happen. You should not punish people for them. Instead, you make the mistake harder to repeat in the future.
Never punish a person for making a mistake. If you do, then you are making it harder for yourself in the future, as when another issue happens (and they WILL happen!), people will not own up to what they did, as they will be afraid of punishment.
Instead, if a specific person is found to be the cause of an issue, you need to find out how they came to make the decisions that led them to do it. For example, was the wording of something not clear enough? Was it too easy to do the wrong thing?
We’re all only human, and we improve over time.