Listen to post:
Getting your Trinity Audio player ready...
|
In the early hours of July 19th, 2024, CrowdStrike endpoints on Windows machines worldwide received a faulty content update, causing what is shaping up to be the one of the largest global IT outages to date.
All over the world reports of Windows workstations and servers stuck in a boot loop with a BSOD were pouring in, impacting airlines, airports, banks, hospitals and many other critical infrastructures such as emergency services call centers, and the list goes on.
Many details including a detailed RCA from CrowdStrike will surely follow and shed more light on this, detailing why an update was pushed to the entire install base and how it passed testing, but until then nothing but our best wishes for our colleagues at CrowdStrike managing this incident.
Nonetheless, this is a good opportunity to discuss and highlight Cato’s Gradual Deployment Model, which is at the very core of how we manage our cloud service and the managed endpoints using the Cato Client.
Graduality, and more graduality
At Cato there isn’t a single stricter guideline throughout the entire Engineering and Operations organization than graduality. And it is without a doubt the most followed through guideline whether it’s in coding practices, performing production changes or publishing new software updates.
In simple terms, nothing is EVER executed on everything all at once. That ‘everything’ can be servers in our Cato SASE Cloud service (e.g. cloud PoPs, backend management services, Kubernetes clusters, etc.), managed Socket devices or Cato Clients running on the endpoints of our customers.
Over the years we’ve developed multiple dedicated infrastructures and feature suites serving this methodology, including automation for deployment with real-time checks of failures in between phases of deployment and features allowing admin full control of how they manage updates of Cato Sockets and Cato Clients inside their organization.
Graduality allows them to do it at a pace that’s acceptable and meets the parameters that each IT organization sets for itself, providing the necessary time in between every phase and update group to make sure that if something goes wrong there is time to discover it and reduce the impact radius.
Cato Client Gradual Rollout – Client Upgrade Policy
For comparison, we will highlight the way Cato manages updates to its Cato Client, which is similar to how the CrowdStrike agent is installed on all workstations of the organization.
When a new client version is approved for release, following its extensive automation and regression testing, it goes into a release pipeline that is managed from start to finish. New client versions are distributed gradually between groups of customers and are never made available to all the groups at once.
A worthwhile mention is that Cato employs “dogfooding”, and the very first clients to be upgraded are all the Cato Clients managed by Cato’s own IT department, and using the same tools and methods as do our customers, as a final gate of quality control.
At the scope of a specific customer for which an update has been made available, their IT administrator is able to control how the client will be published to the users withing the organization using the Client Upgrade Policy.
The Client Upgrade Policy is a native graduality mechanism that the admin uses to control the pace of upgrades of the Client, with granularity to control different rollouts based on the endpoint platform. Initially a “Pilot Group” of users receives the update, typically these are IT members and other early adopters that can identify and report any issues first.
After the Pilot Group, the client update continues to rollout gradually to the rest of the install base, with the administrator being able to track the progress in the CMA and pause the update at any moment if it’s required.
Summary
This recent global outage highlights the critical need for robust deployment practices. At Cato Networks, our [quite overzealous] commitment to gradual deployment models ensures that any changes or updates to our cloud services and endpoints are meticulously controlled and monitored.
By deploying updates in phases and giving the tools and fine-tuned control of Client updates to the IT teams we minimize the risk of widespread disruptions and provide ample time to detect and address issues early. This approach not only enhances the reliability of our services but also gives our customers confidence in the stability of their IT operations.