Lessons from the CrowdStrike debacle for your business

On July 19, 2024, a buggy software update to CrowdStrike took down 8.5 million Microsoft Windows devices worldwide, causing them to get stuck on the “blue screen of death.” Flights were canceled, financial transactions froze and tech stocks tumbled.

“What is CrowdStrike?” you might have asked yourself while trying to make sense of your phone’s alerts and nightly news. It’s a U.S. cyber security company with a major footprint in the tech market. Falcon, the software that crashed, is a product installed on computers to detect and isolate cyber-attacks and malware. We enjoyed the explanation of what happened from Toby Murray here.

This means that CrowdStrike is “privileged software.” It’s the groundskeeper with all the keys to your house so it can access anything on your property if you’re out of town and something goes wrong.

Sensationalist headlines abounded. Big questions were raised. Is the Cloud not secure? What does this mean for how organizations tackle cybersecurity?

Whether or not you hold a technical role within your organization, we have some good news for you—the CrowdStrike failure isn’t purely a technical problem. At least not really. It’s about business processes, change management, and how your organization handles risk.

Here are some of the key takeaways to think about as corporations, airlines, and banks continue to fix affected computers. (We’ve added memes for your entertainment… )

1. Practice makes perfect

Who you gonna call? (Don’t say “Ghostbusters”…)

When parents leave their kid home with a babysitter, they don’t say, “Ok, so if there’s an emergency, we want you to start Googling our names, try to find our LinkedIn accounts. Maybe you’ll find a phone number.” You leave a list of key phone numbers—how to reach you, maybe the doctor’s info and other emergency contacts.

How to recover from deploying a software fix is something your team should practice when things are fine, or mostly fine. This helps you make sure you’re ready if something ever goes seriously wrong.

Next time you deploy something to test prior to shipping, have your team evaluate how you’ll respond if something goes wrong. Here are a few scenarios to work through:

- Who do you call if you need to restore something from back-up? Where are your back-ups stored and who manages them?

- Who has the keys to your Domain Name Service configuration? (Basically, the system that translates what you type to where your app is stored.) How long would it take to switch data centres if there was some sort of outage?

- Who handles recovery when you have the ability to fix the problem? Make sure that you’ve assigned accountabilities and practiced recovery with the right people.

(If you want to do “hard mode,” ask your developers for a Chaos Monkey—a software tool developed at Netflix that creates random spot failures to test a system’s resilience in pre-production, so you can practice before it affects your users, live.)

2. Isolate and limit damage (via deployment rings)

If you ask any product owner or software developer, they will probably tell you that there was a time that they hit “deploy” on a new feature or a bug fix and held their breath.

Ensuring you have a good deployment process is part of how you, first, limit damage and, second, isolate damage.

Deployment rings have been part of software best practices for decades. Consider how they might apply to your team:

- First, privately deploy features to canaries—a group of early volunteers to test a feature.

- Second, privately review an entire release with early adopters—essentially an entire preview where you can systematically review and test your release.

- Finally, deploy publicly to users.

We’ll never be able to completely prevent software bugs. That’s a normal part of releasing updates and new features to software. But you can limit the scope and increase your chance of isolating damage. The best thing about employing this methodology? Identifying a release process that helps limit damage can be done as simply as identifying your test groups on a legal pad or tracking them in a spreadsheet. It’s low-tech mitigation with high-tech benefits.

3. You can’t buy compliance, but you can enforce it

CrowdStrike is just one of many tools in use across organizations’ IT assets. We use tools for virus scanning of attachments in email, to verify login credentials, and yes, detect cyber-attacks.

We buy these tools as a method of transferring risk to a third-party. It doesn’t make sense for most governments organizations to write their own malware detection software. There are private-sector organizations who are world leaders in this area, after all, so we invest in these SaaS solutions instead to ensure organizational standards for security, confidentiality and integrity of our systems.

If you’re asked to check out tools or procure a solution, think about:

- Is it acceptable to transfer the risk?

- How can you procure tools that follow the principle of “least privilege”—the security concept that privileged users (or processes acting on behalf of users)—should have access to the minimum data necessary to accomplish assigned tasks. Think of it this way: if you run a convenience store with 10 employees, the only ones who need access to the safe and petty cash are the ones who lock up the shop at night.

- Can this system be “sandboxed” in some way that would limit a cascade failure if something went wrong? To continue with our analogies, this is the equivalent of installing fire doors between rooms so a blaze doesn’t engulf an entire building.

At Button, we do see some of the world’s best engineers working on solutions to help prevent an outage of a CrowdStrike magnitude, through something called eBFP that will limit catastrophic failures when updating kernel code. (The link is kind of technical, but fascinating! If you find it exciting, but you have questions, we’d be happy to chat.)

Our big takeaway from July 2024’s massive outages is that this isn’t really just a CrowdStrike problem. You can buy software, but you can’t necessarily buy compliance or prevent bugs.

But you can prepare, limit the scope and impact of the damage, and understand your risks.

‍So let’s talk about it. We’re happy to help!

Subscribe to Button Insight ✉️

Our twice-monthly digital services newsletter. Get our best tips, insights and resources delivered straight to your inbox!

Get in Touch!

We love to have conversations with decision makers, technology leaders, and product managers from government and industry. If that sounds like you, and you have a digital project you’d like to let us know about, please fill out our contact form.

Our business development team will reach out promptly.

Alec Wenzowski

President & Founder

Thank you! Your submission has been received!

What to expect

1. We'll get back to you in 1 business day
2. We'll schedule a discovery meeting at your convenience to get to know you
3. We'll listen attentively and see how we can best provide value as a team

Oops! Something went wrong while submitting the form.

Lessons from the CrowdStrike debacle for your business

Hint: It’s less about software and more about business process, change management, and how your organization handles risk.

Subscribe to Button Insight ✉️

Get in Touch!

What to expect

Follow Us

Contact Us

Subscribe to Button Insight ✉️