Emergency services aren’t supposed to fail, ever. That’s been the promise of 911 since its inception. And yet that’s exactly what happened last month when the CenturyLink outage impacted 15 data centers in the US taking down 911, cloud, and DSL services.
It underscores the national importance of our Internet backbone and begs the question if 911 services were dual-homed to multiple backbones. For enterprises looking to rely on the Internet as the basis for the new WAN, core redundancy might be the most important lesson.
CenturyLink: What went wrong?
We’ve seen other Internet outages in the past, true. Around the August timeframe, for example, Interoute (now GTT) saw its network go down. In June, cable cut took out the Comcast network. And late last year, a BGP router on the Level3 network leaked routes from a misconfigured AS.
But the CenturyLink outage was unique in its scope and length. The cause of the outage? A bad network card it would seem. On December 27 at 08:40 (GMT), CenturyLink says it identified “initial service impact” in New Orleans, LA. The NOC was engaged to investigate the cause and Field Operations were dispatched for assistance onsite. Tier IV equipment vendor support was engaged when hey determined that the issue was larger than a single site.
During the troubleshooting process, a decision was made to isolate a device in San Antonio, Texas from the network as it seemed to be broadcasting traffic and consuming capacity. This action did alleviate impact, however, investigations remained ongoing. Focus shifted to additional sites when networking teams were unable to remotely troubleshoot equipment.
Field Operations were dispatched to sites in Kansas City, MO, Atlanta, GA, New Orleans, LA, and Chicago, IL. As visibility into the to equipment was regained, Tier IV equipment vendor support evaluated the logs to further assist with isolation. Additionally, a polling filter was applied to the equipment in Kansas City, MO and New Orleans, LA to prevent any additional effects.
So significant was the outage that Federal Communications Commission (FCC) Chairman Ajit Pai has launched an investigation into the outage.
“When an emergency strikes, it’s critical that Americans are able to use 911 to reach those who can help. The CenturyLink service outage is therefore completely unacceptable, and its breadth and duration are particularly troubling. I’ve directed the Public Safety and Homeland Security Bureau to immediately launch an investigation into the cause and impact of this outage. This inquiry will include an examination of the effect that CenturyLink’s outage appears to have had on other providers’ 911 services…”
Redundant WAN Core Design
To prevent being impacted by outages like the one that hit CenturyLink, organizations need to be sure to build redundancy into their SD-WAN core. Having complete redundancy in the last- and middle-mile is the only way to ensure that a single event won’t take down your network, as it did to 911.