We’ve spoken a lot about the capabilities of SASE like Zero Trust and security in general, but features mean little if there’s a SASE outage. Today, all SASE vendors provide high availability (HA) SD-WAN device configurations, but, of course, SASE networks involve more than just SD-WAN devices. They involve a full range of security services, connect mobile users and cloud resources, and often involve alternative backbones. With each additional element, the risk of outages only grow. You must, I repeat must expand your HA considerations to involve these other elements and carefully consider how the overall platform behaves when facing a wider range of fault conditions.
Cloud, Cloud, Everywhere
While Gartner expects SASE platforms to ultimately run security and networking on cloud-native platforms, this process is likely to take several years. Today, many SASE vendors converge networking and security into appliances or deliver multi-vendor SASE where they couple their SD-WAN devices with third-party security solutions. (They all should show a migration path to a fully converged, cloud-native solution, as we highlight here.)
But make no mistake: With all SASE players, some aspects of their platform will rely on the cloud or, more precisely, exist outside of enterprise controlled or owned location. It’s a question of a degree. At its more basic, it’s Internet connectivity and, when predictable global connectivity is needed, it’s going to be a private backbone provided by the provider or third-parties.
You’ll need a virtual node to connect IaaS and SaaS into your SASE deployment. This will be a virtual appliance you deploy near the cloud instance, typically on shared cloud infrastructure, or as one tenant of a multi-tenant cloud gateway. Virtual appliances may also be run in the cloud to enforce security policies or provide management. And at the extreme, all significant packet processing and security enforcement will happen on shared infrastructure.
Resiliency and the Cloud
As SASE deployments involve components into the cloud, here are some of the points we investigate with our clients:
At the level of transport, redundancy is already built into the Internet, but the alternative private backbone can be different. Consider what happens if there’s an outage on a path. Does the alternative backbone have multipathing to route around the outages? Many times an outright outage is the easiest to detect. Of more interest will be if latency or loss start to spike. Will the backbone detect the change and dynamically direct traffic along the alternate path?
As you research SASE, explore how the network behaves if an SD-WAN appliance or remote user should lose connectivity to a PoP. This may be because of an outage in the PoP, a software failure, or simply a loss of connectivity. Do sessions failover to another component in the POP? If the full PoPs fails, do sessions failover to another PoP? Also, look at what happens to session state. Is it maintained during the failover? Is the interruption brief enough that the session continues to operate unnoticed by the user?
Virtual Appliance Resiliency
When running virtual appliances in the cloud, they effectively act as PoPs and many of the same issues apply. More processing may be done in the virtual appliance, so additional questions would apply. For example, if the virtual appliances are providing security inspection and policy enforcement, what happens in the event of an outage? Also, how long does it take to restart the appliance in the event of a failover? AWS and Azure build redundancy into their cloud components, but outages do occur, as we saw right before Thanksgiving. Which of those cloud services do the SASE providers rely on, and how will they behave in the event of a failure?
Every SD-WAN has a controller where policies are instantiated and pushed out to the SD-WAN devices for enforcement. Redundancy in the control plane is vital to the ongoing management and control of the SD-WAN or SASE. If a controller or a specific instance fails, it should have hot failover to backup controllers. If the control function is distributed across nodes, then other nodes should be able to step in.
The most vulnerable part of your infrastructure will likely be the last-mile access. This is where most outages and disruptions occur. Companies spend a lot of time (or should spend a lot of time) building in HA and resilience to prevent last-mile outages from disrupting the location. They add appliances. Put in redundant, dual-homed last-mile connections.
SD-WAN vendors will tell you that failover between HA appliances can be near-instantaneous, and users won’t miss a beat. This is partially true. When your DHCP source is the ISP, as is the case with most DSL connections, switching between providers can flip the IP address, will disrupt most applications. It’s why some vendors will assign their device IPs from their own infrastructure. Again something to consider.
Management and Monitoring
Finally, take a look at the “wrapping” around which outages and uptime are reported. Your SASE provider should be able to report all fault conditions and events in a single console, regardless of where those events happen across the SASE network.
Resilience is Critical: Here’s What You Can Do Next
Assessing resilience in network architectures is never easy, doubly so when what’s being assessed isn’t in front of you but in the cloud or a service. These are some of the points we consider when working with our clients but there are many more. If you’d like additional insight or help with your assessment, sign up for our SASE Jumpstart Kit where SASE resilience is a key part of what we cover.