When the Cloud Goes Dark: Why Owning Your Infrastructure Matters for Critical ServicesΒ
|
Listen to post:
Getting your Trinity Audio player ready...
|
On June 12, 2025, a global outage at Google Cloud Platform (GCP) brought critical infrastructure to a halt. The ripple effects were immediate. Services from Palo Alto Networks and Cloudflareβboth of which rely on GCPβexperienced outages that lasted hours. Enterprises depending on these services were left blind and exposed.Β
Β
This wasnβt a first. It wonβt be the last. But it was a wake-up call.Β
Β
When SASE, SSE, or SD-WAN platforms go down, the business is down. Productivity stalls. Security gaps open. Risk skyrockets. For services this critical, high availability isnβt a nice-to-have. Itβs non-negotiable. And it starts with architectural ownershipβbuilding on infrastructure you control.Β
The Illusion of Cloud RedundancyΒ
Hyperscalers like GCP, AWS, and Azure offer impressive scale and global reach. But they werenβt designed for networking and security services that must deliver real-time performance and five-nines uptime. Theyβre general-purpose cloudsβbuilt to run apps, not carry the enterpriseβs network and security critical infrastructure.Β
Β
Vendors who build their services on hyperscalers inherit this fragility. When GCP stumbles, everything on top of it stumbles too. Redundancy within a single cloud provider wonβt save you from a systemic failure like the one on June 12. And cross-cloud resilience? That only works if itβs designed and tested from the ground upβnot patched in after the fact.Β
Critical Infrastructure Demands a Different Kind of CloudΒ
The Cato SASE Cloud Platform is different by design. We built our own global backbone, with over 85 PoPs running Cato-owned, bare-metal infrastructure in top-tier data centers. We donβt rent resilienceβwe engineer it. Each PoP runs a fully redundant stack, with our Single Pass Cloud Engine (SPACE) inspecting and enforcing policy on all traffic. When a SPACE, compute node, or PoP fails, traffic shifts automaticallyβno intervention needed, no service disruption.Β
Β
This isnβt theory. Itβs been tested repeatedly:Β
– When a fiber cut disrupted a major carrier, users didnβt noticeβour backbone rerouted traffic instantly.Β
– When a UKΒ data center went down, Cato rerouted around it with zero downtime.Β
– When power outages hit Spain and Portugal, Cato customers stayed connected and protected.Β
Β
These werenβt flukes. They were outcomes of deliberate design.Β
Itβs Not Just About Owning the MetalβItβs About the ArchitectureΒ
Resiliency isnβt achieved by throwing more hardware at the problem. Itβs about eliminating single points of failureβacross compute, connectivity, inspection, and control.Β
Β
Catoβs architecture distributes inspection across thousands of SPACEs running in parallel. Each can process encrypted traffic at line rate, with full security stack enforcement. If one fails, traffic moves to the next SPACE. If a PoP fails, tunnels automatically reroute to the closest healthy PoP. Thereβs no need to reconfigure, no waiting for DNS propagation, and no failover scripts.Β
Β
Compare that to appliance-based or cloud-hosted architectures, where firewalls, SD-WAN edges, or inspection points are bound to a specific instance. Failover requires planning, testing, and manual interventionβoften during a crisis.Β
Service Uptime Starts with Data Plane Ownership
When delivering critical infrastructure like SASE, availability and resiliency must be engineered into the data plane itself. This means more than aiming for generic uptime metrics. It requires delivering a true 99.999% availability SLA at the data plane layerβwhere user sessions are inspected, secured, and routed. If that layer goes down, the enterprise is either offline or dangerously exposed.
To meet this SLA, every component participating in the data plane must be designed for automatic recovery. Any dependencyβbe it a physical devices, third-party services or cloud providersβthat lacks redundancy and fast failover mechanisms becomes a single point of failure.
In practice, this means one thing: you have to own the infrastructure. Only then can you design, control, and continuously improve its resilience. When third-party services must be used, they must be architected as stateless, easily replaceable, and monitored components that can fail without impact. Recovery must be automatic, not reactive.
If you donβt control the infrastructure, you must at least control how it failsβand how quickly it recovers.
SASE Deployment Made Simple with Cato | Download the white paperWhat Enterprises Should Demand from Their SASE Vendor
If your SASE vendor canβt survive a cloud provider outage, they haven’t built a SASE cloud. They’ve rented one. And thatβs not enough.Β
Β
Ask your vendor:Β
– Where is your control plane hostedβand what happens if itβs down?Β
– What happens if a PoP fails mid-session?Β
– How fast is the failover between providers or regions?Β
– Are your inspection engines multi-tenant, self-healing, and load-balanced?Β
Β
And most importantly: Have you proven it?Β
Β
At Cato, we have. Not once, but many times.