Published On: Thu, Nov 30th, 2017

Improving Software Quality by Embracing Devops Failure

It is good to fail, right? Whenever we are conducting software development workshops, I pose this question and guess what? The common answer, probably yours too, is no. We have a culture that glorifies success and condemns failure. According to Thomas Edison and many others, this culture misses the point. After trying over 100 ways to make a light bulb that failed, he opined that he had not failed 100 times but had found 100 ways that did not work. Therein lays the goodness of failure.

Failure is only bad or even wrong when we don’t learn from it. Unfortunately, many organizations engage in corporate blame-apportionment games to date. In our training, we teach DevOps teams to encourage failure. Madness, you might say, but there is a system to it and lots of supporting evidence.

Mainly, it is based on the fact that the cost of system failure increases as you progress further into system development and deployment lifecycle. The cost of the recall of a product or issuing patches of deployed system is very high both in actual expenditure and brand reputation. Web search results will reveal numerous software companies whose demise was caused by shipping faulty software. On the other hand, the cost of an additional two-hour meeting during requirements gathering or system design is negligible.

When corporate culture is not conducive to reporting failure or potential failures, staff members tend to pass the buck in an attempt to make it someone else’s problem. In the end, system issues will be noticed and reported by the customer, which is not an ideal situation.

Tips for constructive DevOps failure

The first step towards constructive failure is the paradigm shift alluded to above. You need to shift from looking for someone to blame when systems fail, to how to improve. The goal should not be who made the mistake, but how it happened. People are prone to errors just as they are prone to genius. Replacing one person with another does not make the mistake(s) more unlikely, it lowers morale and initiative instead. On the other hand, an investigation that reveals process errors results in process improvements that lead to fewer incidents.

Tips on conducting constructive failures

  1. Postmortem every incident:It does not matter how small you regard the incident. The fact that it happened shows there is a problem and by resolving it you avoid future escalation.
  2. Publish and index every postmortem:Since information is power, then limiting the circulation of a postmortem report is reducing the power available to your team.
  3. Establish criteria for bug and issue team appointment:Having a regularly changing team increases the chances of novel and important findings due to different perspectives.
  4. Regularly follow up on postmortem recommendations:Having reports is not enough. So, you need to ensure they are followed up on.
  5. Automate incident reporting:This makes it easier to spot trends and ensure each incident is followed up on.Conclusions

Complex systems are bound to fail. As such, failure should not be avoided. Instead, it should be accepted and taken as an opportunity to improve. Each failure should result in a postmortem with reports that can be an essential tool in building better systems and products.

Nowadays, cheap dedicated server unlimited bandwidth can help enormous search and help us to
use a vast area of our choice for sure.

About the Author