Human error is not the root cause
Posted by Michał ‘mina86’ Nazarewicz on 12th of January 2025
In 2023 UniSuper, an Australian retirement fund, decided to migrate part of its operations to Google Cloud. As port of the migration, they needed to create virtual machines provisioned with limits higher than what Google’s user interface allowed to set. To achieve their goals, UniSuper contacted Google support. Having access to internal tools, Google engineer was able to create requested instances.
Fast forward to May 2024. UniSuper members lose access to their accounts. The fund blames Google. Some people are sceptical, but eventually UniSuper and Google Cloud publish a joint statement which points at ‘a misconfiguration during provisioning’ as cause of the outage. Later, a postmortem of the incident sheds even more light on events which have transpired.
Turns out that back in 2023, Google engineer used a command line tool to manually create cloud instances according to UniSuper’s requirements. Among various options, said tool had a switch setting cloud instance’s term. The engineer omitted it leading to the instance being created with a fixed term which triggered automatic deletion a year later.
So, human error. Scold the engineer and case closed. Or is it?
Things are rarely so easy. ‘The outcome knowledge poisons the ability of after-accident observers to recreate the view of practitioners before the accident of those same factors,’ writes Richard Cook. This hindsight bias ‘makes it seem that events leading to the outcome should have appeared more salient to practitioners at the time than was actually the case.’1 Don Norman further observes that even when analysing our own actions, ‘we are apt to blame ourselves.’
But ‘suppose the fault really lies in the device, so that lots of people have the same problems. Because everyone perceives the fault to be their own, nobody wants to admit to having trouble. This creates a conspiracy of silence,’ resulting in fault never being addressed.2
It’s easy to finish a post-accident analysis by pointing at a human error. Especially when there exist numerous safety procedures, devices and fallback mechanisms designed to prevent catastrophic outcomes. However, as James Reason warns us, while ‘this is an understandable reactions, it nonetheless blocks the discovery of effective countermeasures and contribute to further fallible decisions.’3
If what we are concerned with is efficiency and safety of systems, assigning blame to individuals is counterproductive. Firing ‘responsible’ party in particular may ironically cause more accidents rather than preventing them. Having experienced a failure, the operator ‘responsible’ for it is better equipped to handle similar situation in the future. Meanwhile, a new person is prone to making the same error.
Fundamentally, people make mistakes and if they fear being held accountable they are less likely to admit to errors. This destroys a feedback loop which allows for latent failures to be analysed and addressed. An organisation which repair bad outcomes by disciplining ‘offending’ operators doesn’t end up with operators who make no mistakes; it ends up with operators who are great at hiding their mistakes.
This is why practice of blameless postmortems (where incidents are analysed without assigning blame) is important. And it is why attributing the root cause to human error is feckless.
1 Richard Cook. 2018. How Complex Systems Fail (Revision G). Cognitive Technologies Laboratory, University of Chicago. Retrieved from researchgate.net/publication/228797158.
2 Donald A. Norman. 2013. The Design of Everyday Things: Revised and Expanded Edition. Basic Books, New York. ISBN 978-0-465-05065-9.
3 James Reason. 1990. The Contribution of Latent Human Failures to the Breakdown of Complex Systems. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences. doi:10.1098/rstb.1990.0090. PMID:1970893.