A customer of mine and I were chatting over lunch the other day, and our conversation turned to what was going on with his network.  His routers were coming end-of-life.  I asked him what his plans were, and he looked at me with a blank stare and said: “Upgrade them, of course.”  I do not think it is ‘of course’ at all.

For years, we dutifully upgraded our routers because, well, there hasn’t been much of a choice. Routers are not just fundamental to enterprise networks; they are enterprise networks. We have needed them to connect our offices, stores, and warehouse.

However, increasingly, I am finding routing to be inadequate. It’s complex. It fails to account for applications difference.  All of which and more tells me it’s time for businesses to router away from their routers and shift to SD-WANs.

A simple statement, I know, but to understand why you will need to read a bit more. Warning this post is longer than one of my usual posts so you may want to jump around here:

What’s Routing Anyway?

Fundamentally, the routing protocols used in most business today were designed to solve the endpoint reachability problems in large networks. Scale, convergence times – these were the attributes that drove companies in the ‘90s to abandon RIP and look at link-state algorithms, namely OSPF.

OSPF routers advertise the state of links to adjoining routers. Those advertisements are forwarded to the other routers in the network, giving every router the raw data for building a complete route table. The routers determine the shortest path by adding up the path costs for each segment – the lower the path cost, the shorter the path. Practically, path cost calculations in most businesses are associated with link bandwidth. The routers run the Dijkstra algorithm on that data to determine the shortest path between two points, which becomes the route.

Routing is Broken

We all know business requirements shifted enormously since the last century. Routing hasn’t kept up. Voice and other real-time application have joined the mix of applications crossing the enterprise.  Each of these has very different characteristics. The best path for VoIP might be the one with the least jitter, but the best path for backup, for example, might be the least expensive one with the most bandwidth.

OSPF, though, doesn’t see the differences between applications. It calculates one path cost for all applications. Even then basing cost calculations on bandwidth is an approximation of performance at best. Application performance is more likely to be determined by latency and loss than bandwidth over long distances.

The average enterprise network today is also far larger than in the ‘90s, and that presents problems particularly for real-time protocols.  Convergence times are fast, but not fast enough (sub-second) to maintain a VoIP session between two locations.

Internet access, an option in the ‘90s, is a requirement for corporate networks today. Many companies continue to direct cloud and Internet traffic through a central Internet portal. This can lead to poor performance due to the “trombone” effect. Architecting local Internet connections can be done, but it requires thinking through asymmetric routing and firewall inspection boundaries causing out-of-state flows.

Application protocols today may be fewer than in the ‘90s, but only because applications migrated higher up in the stack, running above HTTP. Legacy traffic shaping and load balancing equipment did not distinguish between applications above port 80. Even if legacy equipment can differentiate between HTTP applications, they still treat all application sessions the same. Clearly, though, a Skype video call and file transfer, for example, have very, very different requirements.

Operationally, we all know about the downsizing and IT reductions.  IT staffing declined by 10% between 2010 and 2014. Back in the ‘90s branch offices were often supported by a local IT person. Today, most companies that I speak with have eliminated or looking to reduce on-site IT support for branch offices. All of which means makes usability, remote management and rapid deployment hugely important.

However, today’s protocol stack remains complex. An enormous range of protocols and options need to be mastered to make IP networking work.

OSPF is just one part, but we know that VPNs, QOS, ACLs, and more also need configuration. Routing has become an art form, one that most companies would like to avoid.

SD-WANs: Internetworking Made Simpler

SD-WANs, I’m finding, can meet our routing needs with characteristics more suitable for today’s companies. With one device, we can eliminate much of the functionality at the edge —  the router, the load balancer, the WAN optimizer, and the VPN. Configuration is much simpler and faster, particularly considering you include Internet service, where you control bandwidth provisioning, into the corporate WAN.

SD-WANs also let you improve application performance. Most SD-WANs allow you to define traffic profiles where you can stipulate the maximum and minimum tolerance for loss, latency, jitter, and bandwidth. Once policies are defined, they are pushed out to the SD-WAN appliance at the location which uses the policies, along with real-time path performance metrics and deep packet inspection (DPI) technology for identifying applications, to direct traffic into the optimum path. Voice can be sent down paths with the least jitter and loss; cloud traffic can avoid the trombone effect and be sent to the secure Internet portal closest to the cloud destination.

As for costs, the customer I mentioned would have spent about 30% more in capital costs, but would have recouped his costs within year one by using Internet bandwidth. Internet transits costs have declined 99.95% since OSPF broke onto the scene (more specifically, 1998). Bit-for-bit Internet costs can be as much as 90 percent less than MPLS within the same region.

Bits & Bytes: How SD-WANs Replace Routers

Of course, to make all of that work we need our SD-WANs to perform the same reachability function as our routers and to fit into our networks. Unless you’re about to forklift upgrade every router in your network and your ecosystem, you will need both qualities – reachability and backward compatibility.

Reachability is pretty straightforward within the SD-WAN. SD-WAN appliances are installed at each location and appear as the default gateway for each site. The devices form a virtual fabric of IPSec VPN tunnels between them and an SD-WAN controller.

Deploying a new appliance into the SD-WAN is easy.  SD-WAN appliances automatically reach back to the controller upon connecting to the network. Once authenticated and approved, the device downloads pre-defined configuration and traffic policies, configures itself, and sends locally attached prefixes back to the SD-WAN controller. The controller distributes the prefixes across SD-WAN to the other nodes along with live link performance information and addresses for all the other SD-WAN nodes. The appliances use the traffic policies, which define minimum and/or maximum latency, loss and bandwidth tolerances per application group, to select the optimum path across the SD-WAN fabric for each application.

The nodes use the information to form a mesh of VPN tunnels (the overlay) with other appliances and share link information. These tunnels can cross an MPLS network, an internet service, or private links. With that data, the appliance  can calculate optimum paths for applications based on the latency, loss and bandwidth characteristics of the available paths and the customer’s traffic policies

Incoming packets go through some form of deep packet inspection (DPI) where the SD-WAN appliance identifies the associated application. It looks up the path across the VPN that meets the requirements of the application policies and forwards the packets accordingly

SD-WAN Path Control

 

While routing protocols may not be needed among the locations participating in the SD-WAN, routing protocols are necessary to reach locations beyond the SD-WAN. Being able to run OSPF or BGP at the edge allows the SD-WAN to connect into the existing routed network gracefully. There’s no need for the impossible – forklift upgrading every possible router. The edge SD-WAN device running OSPF or BGP becomes the destination for prefixes accessed within the SD-WAN.

What About Internet Performance?

The big objections I hear from seasoned network engineers center around performance and uptime. MPLS is a private data service with predictability level and uptime that can’t be matched by the Internet (err maybe not…more on that in a later post). Most MPLS services will include SLAs with 99.99 percent uptime and promise near 0% packet loss. Internet services are lucky if they hit 99.5 percent with high single digit packet loss rates, or so I’ll hear folk argue.

Those are very solid arguments, ones I made when I first started looking into SD-WANs. Having launched and run MPLS-Experts, I too was skeptical about the possibility of using the Internet as a backbone.

The first thing that I discovered was that so many of my assumptions about Internet performance had become outdated. The deployment of fiber in our backbones and the spread of cables between continents have not only increased Internet bandwidth across the globe but also helped reduce bit error rates. As a result, Internet performance has become far more reliable in most parts of the world, except China, of course.

Here’s one indicator. The PINGer project out of Stanford has tracked Internet performance since the late ‘90s.  A look at their loss metrics show other than a hiccup in 2005, median loss rates across the globe have steadily declined since 1999, improving by 88 percent.  With the US, median packet loss rates declined 95 percent since 1998 to .028 percent.

Packet Loss TrendsI still have my concerns about fluctuations in Internet performance. As unmanaged services, Internet latency can suddenly jump in ways that seem to defy logic.  SD-WANs address this problem, though, by letting you connect to and monitoring the performance characteristics of multiple circuits (Internet and MPLS).

When performance levels drop below the defined thresholds in application policies, SD-WAN nodes move traffic over to an alternate path. Depending on the SD-WAN implementation, switch-over can be sub-second, preserving the session. Hard to believe, I know, but I have seen it work, and it is darn impressive.

Availability: Brownouts, Blackouts, Nearly Eliminated

Connecting to multiple networks also plays a significant role in SD-WANs ability to overcome availability problems. To understand this better, let’s break out the issues facing today’s WANs.

From my experience, the easiest uptime problems to detect and address are hard failures like a backhoe splicing a cable. The indications of a blackout are pretty obvious to identify and show the service provider. More often than not the better service providers I work with will identify blackouts today proactively, and the excellent ones will tell their customers before they realize the failure.

Short of a natural disaster (head nod to Hurricane Sandy) hard failures typically occur in the local loop. The core and distribution networks of most underlying Internet and MPLS services contain sufficient redundancy to route around failures.

The high cost of MPLS services often makes redundant local loop connections impractical and not always useful. Customers would need to dual-home a location for full redundancy.  Dual-homing means every component of the physical plant from the cable plant to the conduits to the DSLAMs and more are duplicated. Providers that offer dual-homed services can rarely duplicate every component.  Most buildings don’t have diverse points of entries for telco services.

The more difficult challenges are the brownouts — those intermittent problems whose only symptom is a slowdown in a service or application. Brownouts are the kinds of hiccups that often go undiagnosed. They flare up and then disappear. They can be caused by carrier backbone issues or physical plant issues – a bad connection or poor wiring – on a route, but identifying them can be difficult and getting the service provider’s attention, often impossible.

With both brownouts and blackouts, rapid response in today’s MPLS/IP network is a problem. OSPF or BGP convergence takes too long, or simply doesn’t work. Where companies are relying on routing protocols to switch over to a backup connection, convergence time is not nearly fast enough (sub-second) to preserve session state, dropping application sessions. With many applications, dropping a session is an annoyance at best, barely noticed by users. However, with voice, video conferencing and other real-time protocols dropping sessions drive users mad. Do you want to be the one to tell the CEO why his call dropped?

SD-WANs address availability and consistency issues by aggregating multiple connections. Combining two completely separate services both with 99.0% uptime will yield a theoretical site availability of 99.99 percent. Typically, as we noted, most offices cannot purchase two redundant local-loop MPLS paths, as runs from even separate providers will share some physical components. However, with SD-WAN companies can leverage different access technologies, such as LTE and DSL, which all but guarantees full access redundancy.

Moreover, since SD-WAN nodes continuously monitor the underlying networks, they can respond quickly to brownouts or blackouts.

Like I said, the sub-second failover supported by some SD-WAN vendors is rapid enough to maintain a VoIP call and far faster than OSPF or BGP convergence.

Doesn’t This All Make Sense?

We can make routing simpler; we can make it faster. As Hurricane Sandy (among other weather events or earthquakes) showed us, natural disasters do happen anywhere. When they hit, the Internet will always be your most predictable option.  Add in the dropping Internet costs, and it seems pretty obvious to me that SD-WANs make a very compelling case to be the next replacement for routers. What do you think?