How Cato Cloud Resiliency Overcomes Regional and National Outages

Listen to post:
Getting your Trinity Audio player ready...

Just a day before Thanksgiving, an AWS cloud outage struck down large parts of the Internet for multiple hours, impacting major apps, websites, and services worldwide like Autodesk, Roku, and Shipt. Although only 1 of 23 AWS geographic regions (US-East-1) experienced issues at the time, the global echo was significant for any company dependent on AWS cloud services.

It’s incredibly important to look “under the covers” of all cloud-based offerings, especially those claiming to be SASE services. Simply spinning up a virtual appliance in the cloud or hosting physical appliances and calling it a “cloud-based service” is a far cry from providing an enterprise-grade service that’s designed to work 24x7x365. What happens when the appliance fails? How does the cloud-hosted appliance deal with failures in the cloud provider’s infrastructure? If SASE is to become the networking and security solution, it must be enterprise-grade. This is very much a case where architecture matters.

Cato Cloud: A Self-Healing Architecture

Cato has spent years developing a cloud-native, self-healing platform that can recover from failures at all levels of its architecture. Today, Cato runs a stateless, single-pass cloud-native engine that handles the routing, optimizing, and securing of all WAN and Internet traffic.

Processing is distributed across a cloud-scale, global network of points of presence (PoPs). The controller functionality is a smart, distributed data plane at the processing engine level, not a single controller, eliminating a potential single point of failure. With most processing in the cloud, edge devices and clients accessing Cato are radically simplified, further reducing the likelihood of edge outages.

Every Cato Cloud tunnel and resource has automated failover capabilities inside the PoP, across PoPs, and the entire cloud for a fully self-healing architecture.

Self-Healing of the Cloud Network

Rather than the unpredictable global Internet, Cato Cloud is built on our global private backbone. It’s a global, geographically distributed, SLA-backed network of 60+ PoPs, interconnected by multiple tier-1 carriers. This cloud network is engineered to deliver predictable transport with zero packet loss, minimum latency, and global optimization for maximum performance.

Self-Healing Between PoPs

Upon a failure or degradation in a tier-1 carrier connecting to a Cato PoP, any PoP can automatically switch to an alternate tier-1 carrier in the global backbone to maintain Internet access. If needed, PoPs will connect to the nearest Internet Exchange (IX) for enhanced redundancy.

If any global POP becomes unreachable or disrupted due to maintenance, all tunnels connected to the PoP automatically move to the nearest available PoP. Special rules for failover, regulations, and more are included in the automatic decision-making for tradeoffs. IP ranges associated with failed PoPs are also moved to ensure service continuity.

Self-Healing Within PoPs

All Cato’s PoPs contain redundant servers, each running identical copies of Cato’s software. These compute nodes are available as needed to serve any edge tunnel connected to that PoP. Each

compute node can serve any edge tunnel connected to the PoP. If a compute node fails, the disconnected tunnels will reconnect to an available compute node inside the PoP, as it remains the closest PoP to the disconnected edges. In most cases, user sessions will not be affected.

Overall Self-Healing

And in the unlikely event of total Cato Cloud loss, Cato Sockets can establish direct connectivity to enable branch and Internet connectivity using the public Internet without security or backbone.

Self-Healing at the Edge


Cato edge appliances are thin edge SD-WAN devices with sufficient logic to move traffic into Cato Cloud for networking and security processing. The thin-edge design makes redundant devices affordable. Cato also provides Sockets with redundant components.

Several high availability branch (HA) design options are available:

  • Affordable cold spares with automatic provisioning in the cloud,
  • Warm standby for automatic take over as part of self-healing architecture, and
  • Transport overlay across multiple last-mile transports in either active/passive or active/active configurations.

Sites automatically reconnect to the optimum PoP upon any outage or degradation. In addition, if the Cato Cloud is temporarily unreachable for any reason, branches communicate directly with one another, automatically reconnecting back to the Cato cloud upon availability.

Remote Users

The same seamless HA is available for remote users. If a remote user’s device loses tunnel connectivity or the user roams, Cato Clients automatically reconnect to the nearest PoP with dynamic tunnel failover inside a PoP or dynamic tunnel failover across PoPs to continue all services.

Built-in Self-Healing for Peace of Mind

As the recent AWS outage reminds us, the public cloud, for all its uptime, alone does not guarantee uptime. In today’s cloud-first digital world, fragmented networking point solutions add HA complexity and cost.

With Cato’s self-healing architecture, all failure detection, failover, and fallback are automatic, with no need to manually update networking, security, or optimization policies. Cato’s cloud-native, SASE platform enables global enterprises to meet or even surpass uptime requirements with the best mix of cost, resilience, and enterprise-grade redundancy superior to the unpredictable public Internet and more affordable than global MPLS and other legacy backbones.

Read more about how Cato helps global and regional enterprises in digital transformation

Related Topics

Related Articles