Learnings from Last Week's Global IT Outage
My day started last Friday with an early morning call from my daughter; she was on her way to work at a Boston hospital. A co-worker had texted her that their electronic patient care systems were down but did not know or explain the reason. So, since I’m an IT guy (and her father) she called me thinking a hack caused the outage. I had already read a few reports as to the cause and explained to her that the outage was due to a CrowdStrike software update that had gone awry. After further reading I was surprised to learn of CrowdStrike's dominance in the market, with over half of Fortune 500 companies and U.S. government agencies using the company's software. So, it is not surprising that the problematic update had so much global impact.
Upon learning of the outage and knowing the scale of the companies involved (Microsoft and CrowdStrike), I immediately thought that they have documented change management policies and procedures in place, which are part of a company’s overall quality assurance (QA) program. A change management policy generally states that any new update is first installed into a test system, is tested, and then certified ready for production. A company’s process may also include an initial limited rollout to a selected subset of customers, which - in theory - should limit any adverse impact to only a few select customers.
Best practices suggest that the timing of the change to production should occur outside of regular business hours, but when sufficient staffing is available to address any issues. The change management policy should also require a documented backout plan if the production update fails or otherwise causes problems. If production problems occur, the backout plan should quickly address the issue and restore services. A root cause analysis should then be conducted after production systems are restored. The team conducting the root cause analysis will publish their findings of fact, identify gaps in policies or procedures, and propose recommendations to avoid a future recurrence. It is then up to management to effect change. This outage will assuredly require changes to either or both companies’ QA program.
I don’t wish to cast aspersions on CrowdStrike’s policies or procedures; we don’t exactly know what they did or did not do in regards to those policies and procedures. But, I do know that IT systems are subject to myriad vulnerabilities, consisting of both internal and external threats. Inevitably, failures occur.
This incident should remind us of the importance of sound change management policies and procedures, and the need for those policies and procedures to be consistently followed by the entire organization. CrowdStrike’s final root cause analysis will determine the cause of the bad update; we can use it as a learning experience and an opportunity for software companies of all sizes to review their own QA program.
My daughter had a hectic day at work with her employer’s electronic patient care system being down and having to revert to their backup system: paper.
Joe Brown has over 3 decades of experience managing IT systems and security. He is an AWS Certified Cloud Practitioner, a Certified Information Security Manager (CISM), and a Certified Information Systems Security Professional (CISSP). Reach him at: jbrown@intraprise.com.