CrowdStrike: Key Perspectives on the IT Outage

The recent IT outage experienced by CrowdStrike has sent shockwaves throughout the global business community, highlighting the vulnerabilities inherent in our increasingly interconnected digital landscape. As experts warn of potential exploitation by cybercriminals, it is crucial to assess the lessons learned and outline strategic measures to prevent such incidents from affecting customers in the future. This article delves into the essential steps needed to rebuild trust, enhance communication, fortify operational resilience, and address regulatory concerns to ensure long-term stability and security.

Key Lessons Learned

As experts are now warning of possible further risks as criminals seek to exploit the IT issues, I am commenting on key steps to be taken strategically to ensure that this situation and ‘harm’ does not impact the customers in the future, such that trust is restored longer term.

Communication is key to supporting the company’s external and internal image and rebuilding trust with those impacted. I anticipate that after the crisis communication on Friday, the key would be strategic communication in the next couple of days on key steps that will be taken, not only at the Company level but also at the Government and Regulator levels, to mitigate these risks in the future, such that trust is restored longer term.
The global ripple effect of the IT Outage illustrates the interconnectivity of the complex supply chain and concentration risk in this market. We need to look holistically at the complex supply chain infrastructure that provides systems services as well as products, investing in operational resilience across Data, Technology, Third Parties, People, and processes. Material services should be identified and prioritised to ensure a proportionate response.
This could be a wake-up call for employers to invest in tried-and-tested Disaster Recovery plans, which, unfortunately, in many cases remained more of a paper-based exercise than a plan that was tried and tested at scale across the key simulation scenarios, including extremely unlikely crisis scenarios like this one.
Software vendors like CrowdStrike have become so large and interconnected that their failures can damage the global economic system and tens of millions of customers.

It is key that Companies, Governments, and Regulators as an ecosystem be more mindful of and perhaps concerned about the systemic or concentration risk of being dependent on a single major provider.

Whilst today it was Cloudstrike and Microsoft, it could be that the Cloud giants like Amazon, Microsoft or Google could go down on another occasion, and the impact would be fully detrimental, impacting tens of millions of customers.

From the government’s perspective, we need to start monitoring the situation’s impact in detail and tracking future events that could be small. This would help us build the nation’s ability and resilience to respond to similar events.
As to the Regulators I would anticipate them being more joined up in protecting customers and building Digital Trust. In addition, having a much bigger push to mitigate concentration risk, not only at the level of the Company but also at the level of the Providers that are available in the market to provide the end to end service. I anticipate both tighter regulations, , including the incoming DORA implementation, but also tighter scrutiny from the Regulator should the Company prioritise the cost and efficiency over the safety and security of their Customers.

On Trust

As we saw on the back of the IT outage, with Technologies come risks. And customers may not necessarily be fully aware of what they are exposed to. The global ripple effect of the IT Outage illustrates the interconnectivity across the supply chain and concentration risk in this market.

Building and restoring digital trust is key. Trust is a confident relationship with the unknown to this chaotic uncertainty. As an enabler for strategic decisions, Technology could enable millions of people across the world to take a Trust leap. Further enhancement around Technology, Third-Party management, and Operational Resilience—this combined existing and incoming regulations, for example, DORA is coming into play in 2025—can help ensure that the future of the services and products is cost-efficient but also safe and secure.

I do feel that real disruption happening isn’t technological. At its core is empowerment. Empowering us as customers to navigate through change & uncertainty in an agile & safe way. Communication is key here to maintain transparency & trust. I believe that the communication would not only focus on the impact the IT outage had on the organization, including the material services (e.g. Payroll or Payments) being affected, and anticipated timelines to restore this. but also more strategically, what are the steps being taken by the Company to ensure that this situation and ‘harm’ does not impact the employees and customers in the future, such that Trust is restored in the long term.

On Communication

The proportionate response to the IT outage would include clear, transparent, and timely communication both externally with customers and internally with employees about the impact the IT outage had on the organisation, what material services were impacted, what key steps are being taken today to restore the material services, and the clear timelines when these will be restored.

But more importantly, what are the key steps taken strategically to ensure that this situation and ‘harm’ to the customers do not happen in the future, and trust is restored?

I anticipate that after the crisis communication, the key would be strategic communication in the next couple of days around what will happen tomorrow, not only at the Company level but also at the Government and Regulator levels, to mitigate these risks in the future.

On Regulation

The global ripple effect of the IT Outage illustrates the interconnectivity across the supply chain and concentration risk in this market.

Thinking about key strategic steps to ensure that this situation and ‘harm’ do not impact the customers in the future, restoring trust in the long term, I believe that going forward, the regulator will focus much more on operational resilience, holistically across Data, Technology, People, and Processes.

To ensure a proportionate response, material services (e.g. Payroll or making payments) would need to be prioritised. Whilst there is already a focus on DORA implementation for some industries by the 2025 deadline, there will be questions about whether it is enough to solve this challenge.

I would anticipate a bigger push from the Regulators to mitigate concentration risk, not only within the Company but also at the level of the Providers that are available in the market to provide material services.

I anticipate tighter regulations and tighter scrutiny from the Regulator should the Company choose to prioritise cost and efficiency over the safety and security of its operations and the potential ‘harm’ to customers.

Whilst I am not anticipating new regulations being developed on the back of this situation in addition to existing ones, I am anticipating having more adherence to existing ones, including DORA, FCA, and EBA guidelines, etc, and the regulator being stringent with Companies’ adherence to these. Currently, the level of compliance would vary, and I would anticipate in the conversations with the regulator that the Company would want to demonstrate at least full Level 3 compliance with the key risks and existing or proposed control frameworks, vs this being a checkbox exercise. The issue we had was not purely the Technology release that went terribly wrong but, more importantly, the impact it made both internally on the Company and externally on the Customers of the Company, such as you and I. From the Regulator’s perspective at the Company level, the key questions to answer would be:

What do you do when you realise this issue to restore the material services fast and restore Trust with your customers?
Do you have a comprehensive view of your material services?
Do you have a view of all the data, technology, people, and processes that would be key to restoring these quickly?
Do you have tried and tested recovery plans, including crisis communication both internally and externally with customers?
How fast can you access all of these and stand this up?

Other Perspectives

How such a failure can occur on a global scale?

The key challenge is the overreliance of the Companies on 1 major provider for their systems and not mitigating the concentration risk by having several solutions in place to avoid the IT outage, such that if 1 of the systems were impacted, there is a workaround in place that could be accessed fast. This is key not only from the Technology side, but also from the People and Processes, e.g. how fast can Companies restore back their material services to continue to be able to provide the services to their Customers.

Another issue is that whilst many Companies are urged by the regulators to have Disaster Recovery plans in place, which are tried and tested should such a massive situation occur, the Companies, in many cases, do these as a paper exercise and do not test these at large scale. As a result, when the outage does happen as it did today, Companies are not prepared to restore the services fast enough. Finally, in many cases, Companies did not focus on operational resilience, which would have required them to identify the material impact and where it occurs, what systems, data, people, and processes are impacted, and focus on restoring these as a priority step by step vs restoring everything.

Why do so many systems depend on just a single vendor? Why the whole OS can be crippled by a software update? What does that tell about the current IT architecture?

In some of the cases the single vendor is a choice due to cost. The rationale is that the vendor is so big and powerful (Microsoft) that the Companies do not anticipate it could go down. This is similar with the Cloud providers, e.g. Azure, where we see Banks building the full Bank on 1 Cloud provider in anticipation that the chance of the outage happening is extremely low. The challenge, of course, is that given that multiple Companies rely on the same provider, once the provider goes down, it impacts millions of customers.

Who should be held accountable? Will businesses trust CrowdStrike after this debacle? Should there be any repercussions?

Businesses will be held accountable by 10s of millions of customers whose flights or services were badly disrupted today. This will lead to Customers seeking compensation, e.g. for the time lost due to cancelled and delayed flights, which would not be covered under a ‘bad weather’ event. The share price of CrowdStrike is very likely to drop by much more than 20%, perhaps as much as by 50%. The is likely to be a consortium built similar to the WannaCry cyber-attack in May 2017 to protect against these situations in the future.

What is your opinion about the recovery procedures and fail-safe mechanisms the largest global businesses have in place?

The Businesses would have the Disaster recovery plans, which in many cases, unfortunately, have remained more as a paper-based exercise rather than a plan that was tried and tested at scale across the key simulation scenarios. There are some fail-safe mechanisms, for example, the concept of Safe Harbour used in the US by the Banks after the WannaCry cyber-attack in May 2017, but it is not something that is used at scale. I believe that the key is not only the fail-safe mechanisms for Technology but also the end-to-end material service across People, Processes, and IT .

What immediate changes are needed to prevent this from ever happening again? What will be learned to make systems reliable and redundant?

One key concern is the systemic or concentration risk of being dependent on 1 provider. While today it was Cloudstrike and Microsoft, the Cloud giants Amazon, Microsoft, or Google could go down on another occasion, and the impact would be fully detrimental, impacting tens of millions of customers.

Moreover, it is very key for Companies to invest in Operational Resilience, which is broader than just Technology. It would cover Technology, Data, Third Parties, Processes, and People. As a priority, the material services need to be identified so that the response is proportionate. It is key to test out the Disaster Recovery plans, instead of having these as a paper exercise, and ensure that all the people, processes, and data (and not only the technology) are tried and tested at scale, and there is sufficient preparation in place should such an Outage happen in future. It is key to do simulation scenarios and testing of these.

What are the potential long-term impacts on the whole market? How might this affect cloud-based security and other solutions in the enterprise market?

I would particularly encourage Companies to look much deeper into their current reliance on Cloud providers and mitigate the concentration risk internally within the Company and worldwide to minimize the impact on the material services should a major IT Disaster occur. The question would be whether the Companies can indeed rely on 1 or 2 providers for material services and what steps they would have in place.

Should regulators do something about the situation in the market?

Regulators like FCA and EBA have regulations that call out concentration risk in the market and the fact that the Companies should be doing much more to mitigate it. There are other regulations globally that have a similar effect. However, I have seen that in practice, there has been a lot of opposition from the Companies to invest in several providers (e.g., cloud providers) given the cost and the fact that the outage probability is typically considered extremely unlikely given the historical probabilities. I believe that going forward, there will be a much bigger push from the Regulator to mitigate concentration risk, not only at the level of the Company but also at the level of the Providers that are available to provide the service. I anticipate both tighter regulations and tighter scrutiny from the Regulator should the Company prioritize cost and efficiency over the safety and security of its operations.

Will this outage be costly to taxpayers? Why?

I think it will be costly for everyone, whether the contributions are spent directly on addressing his challenge and thinking more strategically into the future. I would anticipate that taxpayer money will be funding the direct impact of the IT outage today, but more importantly, the longer-term remediation in the future, including having more robust policies and processes in place from the Regulator and the Businesses to keep us as customers safe and secure.

Alina Timofeeva is a multi-award-winning strategic advisor in Data & Technology. She is a Board member for the British Computer Society, The Chartered Institute for IT, a thought leader and a sought after speaker at large industry events. She has worked with leading consulting firms Oliver Wyman, KPMG, and Accenture, where she served major banks as an expert in cyber, cloud, data, operational resilience, and regulation – to build digital trust and ensure that tens of millions of customers are safe and secure. She is sharing key lessons learned and key steps to be taken strategically to ensure that this situation and ‘harm’ does not impact the customers in the future, such that trust is restored in the long term.