The past week, headlines buzzed about how AWS outage was caused by human error and what Amazon was doing to address it.
- Bloomberg: Amazon Says Employee Error Caused Tuesday’s Cloud Outage
- NPR: Amazon And The $150 Million Typo
- Verge: How a typo took down S3, the backbone of the internet
- WSJ: A Typo Caused Amazon’s Big Cloud Outage
- Recode: Amazon’s massive AWS outage was caused by human error
There is one important fact that Amazon published in its Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region that caught my attention.
Amazon did not exercise the scenario of failure periodically (Disaster Recovery testing) to verify S3 is resilient. If they had, they would have seen the system was not scaling to expectations.
AWS should be publishing results of such periodic failure testing, given the huge impact to SLA. I am sure many tech savvy companies using AWS have other built-in capabilities to reduce the impact of such scenarios, but it negates the reason for going to AWS in the first place. Amazon is not alone in their failure to test Disaster Recovery scenarios. Many enterprise companies don’t do this religiously either because of the enormous costs involved. And, some disasters just cannot be replicated in a test. The fact is software solutions are not as cheap as everyone expects they should be when it comes to mission critical systems. Unless the world wakes up to this truth, we will continue to muddle along.
Which brings me to the topic of the root cause. If you look at any kind of recent disaster, whether system failure in AWS, or major melt downs as in the case of Uber, we look at headlines and debate how one should avoid these. What is needed is the root cause analysis (RCA) to fix the problem, in addition to putting band-aids.
Yes, we need the band-aids, but we also need the discipline to spend energy and resources to look at the root causes and take steps to fix them (the reality is there is more than one root cause, while the impact of fixing one could be substantial, compared to others).
For those who want to know more about RCA, there are excellent resources on the web. I like the “5 Whys” for RCA, modified as needed. But no tool is effective unless you get help from Subject Matter Experts (SME) who can leverage the tools. Investing in the RCA training has huge payoffs for all enterprises.