CrowdStrike & the Greatest Tech Outage in History

While everyone and their grandma has been talking about CrowdStrike, It’s important to understand what it is and why it is such an important thing to discuss in the wake of the biggest computer outage the world has ever seen affecting critical government infrastructure to hospitals, banks, media houses and everything in between all over the globe especially when all of that was caused by a few lines of code.

That’s why in this article we’ll discuss the details and origins of the CrowdStrike outage that occurred this Friday, who’s to blame, who’s affected, why we need to rethink the current security model, especially in critical computer applications, and what’s going to be the aftermath of all of this for the affected and the industry.

What is CrowdStrike?

CrowdStrike is an American cybersecurity company among the global leaders in the cybersecurity domain with their flagship product known as CrowdStrike Falcon which is a cloud-native suite of powerful security solutions provided as a SaaS (Software as a Service) on a per month subscription model mostly for large companies.

Falcon EDR is a single component of that Falcon Platform. Falcon EDR is one of the most reputed and popular EDRs in the industry that has enjoyed industry support for years now and is widely used in critical institutions like banks, federal agencies, and giant corporations all over the globe, From the US Department of Defence to a large number of Fortune 500 companies. CrowdStrike as a company rose to fame due to its EDR capabilities, especially with its approach of mastering security provided through the cloud in real-time which was not popular when they started in 2011.

What is Falcon EDR – Product by CrowdStrike?

Before discussing why companies use CrowdStrike let’s discuss what Falcon EDR is,
CrowdStrike Falcon EDR has two components:

Falcon Sensor

A lightweight program that reports to the CrowdStrike servers and performs real-time detection and response to threats on the endpoint. Falcon Sensor’s just a few megabytes and has no additional requirements besides basic network connectivity. It doesn’t need to scan all files and add resource overhead like traditional antiviruses. This agent, once installed, lives at the Ring Zero which is the highest privilege level for a computer application with direct access to computer hardware. Once installed in an Operating System like Microsoft Windows, Falcon Sensor can bypass it and monitor everything from a level above the OS itself without any restrictions. This has large benefits but a few downsides too.

To install Falcon EDR, the engineer needs to do a one-click install of the Falcon Agent on the endpoint he needs to protect and that’s it, No restart, No additional server required with additional hosting, configuration, Nothing!

Falcon Panel

Once Falcon Agent is installed on a machine, it would appear in the Falcon Panel which is hosted by CrowdStrike themselves on their cloud servers. So, the analyst doesn’t need to worry about any hosting or configuration with the sensor which uses the CrowdStrike account that is tied to the specific license that one has.

All threat-related information with response capabilities with a plethora of configurations is present in the panel. An analyst could manage thousands of endpoints in an environment through a few clicks in the Falcon pane and set up integrations with a lot of other supporting technologies and security solutions. Falcon Panel offers a powerful graphical interface for the analyst to interact with along with an even more powerful API interface.

Why do companies use the CrowdStrike platform

Unified Cloud Native Platform

Traditionally security teams in large companies have used a lot of tools at the same time for protecting their endpoints (just a fancy term for desktop computers, servers, laptops, or phones in an enterprise) such as IDS (Intrusion Detection Systems), IPS (Intrusion Prevention Systems), Antiviruses, Firewalls, SIEM, SOAR, UTM solutions which have led to fatigue between security teams in larger companies.

Most of the tools mentioned need to have specific agents (a small piece of software that reports back to the server) on the individual machines that are to be protected and other servers needed to host those tools and aggregate and process data sent by the agents. Implementing multiple solutions along with the configuration of every single one of them with constant updating and maintenance of both agents and sensors costs a lot both in terms of finance and human hours let alone troubleshooting them. That’s where CrowdStrike comes into play with their cloud-native EDR.

Advanced Detection

Falcon EDR isn’t just an antivirus, malware protection is just a single component of their EDR. Their EDR fundamentally differs in its approach to identifying and dealing with malware (viruses are a form of malware). Normally antiviruses have powerful agents that maintain a definition list of viruses and regularly scan files on a computer to see if some file matches the definition of malware in their definition list.

This approach has been an industry-wide approach for all antivirus solutions but it adds additional resource overhead and is not fast enough to match sophisticated threats. Most antiviruses work at Ring 3 also known as the userspace privilege level where most of the regular programs in an operating system work.

How Falcon EDR differs here is that instead of scanning all files in a system, it goes into the system’s kernel (privilege level above the operating system) and monitors all network, and process execution activity happening inside the computer using machine learning and real-time data analysis baked into the Falcon sensor making use of algorithms trained on CrowdStrike’s petabyte-scale enriched Global Threat Intelligence database that is constantly updating on the fly as new threats are reported around the world. That’s how Falcon Sensor tracks behavior and responds to threats.

Swift Response

Making use of CrowdStrike’s advanced response tactics trained on some of the same petabyte-scale datasets, As soon as the EDR Agent senses any oddly behaving activity, it just stops it from executing instead of waiting for a signature file or scanning the system multiple times in a day. Instead of signature-based definition files, Falcon EDR maintains a behavior-based local list of the latest threats that are updated in real-time.

It can kill an execution, and stop a network packet as it’s starting in an automated response. Aside from the response capabilities it can present a detailed graph view of the threat timeline, its activities, triggered activities, and a comprehensive graphical family tree of the processes and network activities performed by a threat. The updates are synced to such an extent that if a certain machine that has Falcon EDR installed on it doesn’t respond to CrowdStrike servers in 60 seconds, it is marked as unsafe in the panel. Those sync messages between that machine and CrowdStrike servers are known as heartbeat.

More than an EDR

CrowdStrike calls its Falcon Platform an XDR (Network Detection & Response) which is another term that combines the endpoint protection capabilities of an EDR with Network Protection. Previously we’ve discussed Falcon EDR which is just one component of the CrowdStrike Falcon Platform which has several many more components relating to Network Protection, Hardware Security, Threat Intelligence, and a lot more over the same sensor and the same panel we discussed earlier.

Integrating all those modules is just a click away since they use the same panel and the sensor that is used by Falcon EDR. Unlike traditional EDRs, CrowdStrike Falcon Platform is unique in that it offers a lot more with unparalleled convenience for security teams which it weaponizes through its expensive monthly costs.

An overview of the CrowdStrike Falcon Platform offerings

What are Cloud providers?

Cloud refers to a large collection of high performance computers deployed on a large scale connected to the internet for the purpose of renting or providing a service to some end user. Companies who own such large data centers as they are called which provide such services are known as cloud providers.

Cloud providers manage and configure those data centers to provide services hosted on them or the entire computer clusters (the individual computers) to be rented out or used by their clients. Some of the largest players in the cloud are AWS (Amazon Web Services), Microsoft Azure, GCP (Google Cloud Platform).

These companies provide an extensive range of cloud services be they hosted services managed by them or bare-metal instances in their data center. For simplicity we’ll discuss the two widely adopted used Cloud services approaches:

IaaS (Infrastructure as a Service) – This approach refers to providing physical bare-metal compute resources through the internet by a cloud provider. Amazon’s EC2 is an example of IaaS.

SaaS (Software as a Service – This is the more common approach where software companies instead of giving out their software, host them on some cloud server and provide access to it through some interface like a web browser. Netflix, CrowdStrike are examples of SaaS.

Insights into the CrowdStrike outage

At roughly midnight on Friday, 19th July 2024, CrowdStrike released a routine silent update to all of its clients using their Flacon EDR on Microsoft Windows. The update release is pretty normal but what was different this time is that as the CrowdStrike sensor on Windows hosts got updated, it caused a kernel error which rendered swathes of Windows endpoints completely unbootable. Now, the issue is specific to Microsoft Windows since other operations systems using CrowdStrike Falcon were unaffected but it was caused by CrowdStrike due to their faulty update.

What happened?

Before discussing how it’s important to get the hang of some of the terminology

Kernelspace/Ring Zero – It’s a privilege level with unrestricted access to a computer’s physical resources. Operating System kernel functions at this level and any error here would bring up the Blue Screen of Death in Windows which is a protection feature that shuts down the system to protect the data and applications from further damage.

Userspace/Ring 3 – This is the least privilege level in a computer that is controlled by the Operating System and any application crash here could be managed by the Operating System instead of shutting down the entire computer.

All of this outage was caused by just a few lines in the Rapid Response Content update causing a kernel error that CrowdStrike pushed to its Windows users that Friday. It should be noted that Rapid Response Content is automatically updated without user control unlike the Sensor update that happens in a release cycle that the user could control and it is not lines of code but configuration files for the sensor.

The Rapid Response Content update file in itself was just 40 kilobytes and such updates are pretty common sometimes even multiple times a day. Bugs are not something new but normally they are swiftly fixed by CrowdStrike but this specific bug was different.

Falcon EDR has powerful monitoring capabilities that are a level above the operating system and to achieve that it has to run above the Operating System which is the kernel mode with the highest privilege access to a system. It’s the level above the operating system itself and any errors here cause the infamous Blue Screen that’s exactly what happened to millions of Windows computers around the world which would not even boot to the Windows operating system. Blue Screen of Death (BSOD) is a protection feature in Microsoft Windows that shuts down the system when such a severe error occurs. It should be noted that

The fix

CrowdStrike claims it took them 78 minutes to detect and push the updated content file that was the root of this problem but all Windows computers which had updated by then were affected. Usually, when small bugs are released, CrowdStrike issues prompt patches that are automatically applied without any human interaction. But this specific issue was different because this error left the Operating System unable to boot and the only to fix this issue was to go down to every endpoint that had CrowdStrike installed on it. Then manually boot it into the Windows Safe Mode, and delete those corrupted update files that were released as update by CrowdStrike.

After the initial release of the bad update, CrowdStrike promptly removed that update from affecting any more users and released a workaround fix.

The initial fix by CrowdStrike required companies to manually boot each of their Windows endpoints to safe mode, Remove the corrupted update files, and then restart. Now imagine large corporations with thousands of endpoints in a single campus. That’s a nightmare for IT professionals. Microsoft has released a program that automates some of the work but still requires one to manually.

How did it happen?

We don’t have much detailed information right now but CrowdStrike recently did come up with an explanation in their initial review of the situation detailing what had caused the issue.

The company claimed that Falcon Sensor updates, specifically Rapid Response Content, go through specialized checks including resource utilization, system performance impact, event volume, and adverse system interactions through their Content Validator program.

The latest update files were passed through the Content Validator, the company claims but were passed on due to a bug in the Validator itself (the irony). Subsequent checks relied on this initial assessment and that’s what caused the error to pass on.

New measures

In the Preliminary Information Report, CrowdStrike promises to implement several additional measures to prevent similar incidents in the future. Specifically, the firm listed the following additional steps when testing Rapid Response Content:

Local developer testing
Content update and rollback testing
Stress testing, fuzzing, and fault injection
Stability testing
Content interface testing

The Rapid Response Content deployment is set to have the following changes from now on:

Implement a phased roll-out of update deployment
Improve monitoring of sensor and system performance during deployments, using feedback to guide a phased rollout.
Provide customers with more control over the delivery of Rapid Response Content updates, allowing them to choose when and where updates are deployed.
Conduct multiple independent third-party security code and deployment process reviews

CrowdStrike also promised to add additional validation checks to the Content Validator that passed this faulty update as valid and has also committed to releasing a Full Root Cause Analysis with more details once the internal investigation concludes.

The Impact & Aftermath

The outage has caused severe disruption over a large sector of daily life, businesses, and governments including Banks, Airports, Hospitals, Emergency Services, Financial firms, and Broadcasting companies, all over the globe. Rough estimates suggest that approximately 8.5 million Windows computers worldwide were affected.

CrowdStrike claimed before the outage to have 24,000 customers with nearly 60% of the Fortune 500 companies and more than half of the Fortune 1000.

The financial impact of this outage is expected to be huge. Early reports suggest the damage is in the billions of dollars with some reports suggesting US$5B-$10B with the Healthcare and Banking sector hit hardest with a combined $4B of losses followed by the air-lines industry with about $800M in direct losses. But out of all this, only 10-20% of that is estimated to be insured by CrowdStrike. Delta Air Lines alone has sued CrowdStrike, mentioning losses of over $500M. Whether this outage falls under the EU’s GDPR is also yet to be seen.

Reflecting on the situation, CrowdStrike stated on July 31st, that “~99% of Windows sensors are online as of July 29 at 5 pm PT, compared to before the content update.”

Are lives at risk?

Potentially lives were at risk since critical Emergency Services like 911 were seriously impacted in some states and Hospitals were also affected causing delays in providing healthcare.

Transport was also affected that could indirectly affect human lives but we currently don’t have any evidence of lives directly being affected by this outage.

What will happen to CrowdStrike?

It is unclear to predict what would happen to CrowdStrike. It is sure that this incident is a severe setback for the company and it would have to make a stringent effort to regain the trust it has lost, which would be a difficult task. Lawsuits and insurance would be a cost to pay. CrowdStrike’s stock has already shown historic dio dropping from an all-time high.

Will the law take action?

Litigation has just begun with airlines like DeltaAir filing Class Action Lawsuits along with other shareholders for damages caused by the CrowdStrike. But the extent of liability for those damages is subject to CrowdStrike’s terms and conditions.

Regulators like the EU could also jump in to see if the company breached regulation standards but It is unclear right now if the European Union would take action if the availability of customer data through a technical bug like this one is an offense in GDPR, we would have more to hear from the EU.

Will this give a boost to competition?

In the short term, CrowdStrike’s competitors would likely benefit from this outage. The outage has shown how complete reliance on a single vendor could turn out. Companies would likely want to diversify their security apparatus and that would mean increased market share for CrowdStrike’s competitors but it should be noted that this situation is not unique to CrowdStrike, This has happened to other companies before, and this time the scale has been unprecedented.

Similar Large-scale Tech Outages

Amazon Web Services Outage (2019)

In February 2017, Due to a human typo in the billing team’s routine debugging of the system, AWS S3 (Simple Storage Service) went offline for about four hours

About 150,000 domains used Amazon’s S3 service, Quora, Giphy, Instagram, IMDb, American Airlines, Imgur, and Slack were some of the affected. Other than that, some people weren’t able to control lights in their homes due to AWS’s outage. The servers came back online more than four hours later costing about $160M.

Facebook, Instagram, and WhatsApp Outage (2021)

In October 2021, Meta (Then Facebook) faced an outage across its services including Facebook, WhatsApp, Instagram, Snapchat, Messenger, Oculus, and Mapilliary. The outage completely locked out people from their Facebook accounts with outside users not being able to use the ‘Log in with Facebook’ feature. The outage lasted 7 hours costing Facebook $60M in damages while its CEO Mark Zuckerberg’s wealth fell by more than $6B.

Getting value from the content? – A sub ensures that it keeps flowing.
All efforts on SK NEXUS are passion led. Keep showing up, Keep the show running!

Single Point of Failure

The recent war has given us a glimpse of how it could feel to overly rely on other nations’ tech and why every nation must build some form of backup plan, especially for infrastructure that is critical to the people at a large scale. Imagine being a Russian the day various tech giants started issuing sanctions on the country. Giants like AWS, GCP, and Azure who are hosting hundreds of thousands of companies can just revoke access to their servers with a single click.

In the security space, Russia has their own EDR made by their company Kaspersky Labs but for a moment imagine if their state functionaries, banks, and critical infrastructure were being handled by the American CrowdStrike and they wished to pull the plug.

Impact of Tech on the Mango Man (Aam Aadmi)

For an average Russian it could have been like this, He wakes up and can’t access their grocery application using AWS, They can’t use tap to pay for their physical grocery store next block since most of the payment processors have been blocked, They rush to the nearest ATM just to find out that the security service they’ve been using has caused their systems to crash. In a state of frenzy the guy walks home just to find out their TV broadcasters are nowhere to be found. Their ICS (Industrial Control Systems) which control power, water, and gas access has been stopped for some time.

Now take Russia out of this picture and Imagine its Pakistan and what impact would it have had on the public at large who don’t even know what’s happening to them. That’s some of the costs of over-reliance on a select few entities that you don’t control.

The road ahead for CrowdStrike

The CrowdStrike outage of 2024 would go down in history as one of the largest tech outages if not the largest, having an unprecedented impact on critical industries like Banks, Healthcare, and Transport, around the globe. It’s also a wake-up call for companies and institutions to build and set up robust backup mechanisms and have serious checks and quality validation in place especially where security is the priority. The outage has realized the impact of single-point-of-failure if it ever goes wrong.

But after all this impact, there is still a bright side to this incident. In the long run, it would pressure companies like CrowdStrike to develop stronger systems in place to make sure technical bugs like this never halt half the world again. It would also help companies like Microsoft to look for better ways to secure the large ecosystems they’ve established. On that front, Microsoft has restarted the kernel protection debate that had been killed off in the late 2000s. That would have ripple effects long beyond just the cybersecurity industry.

Further learning and references

Falcon Content Update for Windows Hosts
Inside the 78 minutes that took down millions of Windows machines
CrowdStrike blames test software for taking down 8.5 million Windows machines
How Microsoft helped clean up CrowdStrike’s mess
Huge Microsoft Outage Caused by CrowdStrike Takes Down Computers Around the World
Microsoft calls for Windows changes and resilience after CrowdStrike outage
CrowdStrike: ‘Content Validator’ bug let faulty update pass checks
CrowdStrike to Cost Fortune 500 $5.4 billion
We finally know what caused the global tech outage – and how much it cost
Windows Security best practices for integrating and managing security tools
Amazon S3 Outage Has Broken A Large Chunk Of The Internet

SK NEXUS is on a mission to make knowledge more accessible – around tech, around career, around business. There’s a severe gap for actionable knowledge around us – we are on a mission to fill that need.

Your contribution today can help us create more valuable content for many years to come. You can always pitch in by clicking the below.

Your continued support has supported 100+ high quality pieces of valuable content so far.

Saqib Tahir

See, at the heart of it – I love solving problems for people using tech, it doesn’t get simpler than that.
I am known for constant experimentation and relentless execution.
If I have an idea, it better have a .com at the end of it within the month.

Right now – my focus is to help everyday folks of Pakistan understand tech, career, and business better with everything I do at SK NEXUS

sknexuspk.bio.link/