How a Tiny Error Shut Off the Internet for Parts of the US and Canada
A year ago, a DDoS attack caused internet outages around the US by targeting the internet-infrastructure company Dyn, which provides Domain Name System services to look up web servers. Monday saw a nationwide series of outages as well, but with a more pedestrian cause: a misconfiguration at Level 3, an internet backbone company—and enterprise ISP—that underpins other big networks. Network analysts say that the misconfiguration was a routing issue that created a ripple effect, causing problems for companies like Comcast, Spectrum, Verizon, Cox, and RCN across the country.
Level 3, whose acquisition by CenturyLink closed recently, said in a statement to WIRED that it resolved the issue in about 90 minutes. “Our network experienced a service disruption affecting some customers with IP-based services,” the company said. “The disruption was caused by a configuration error.” Comcast users started reporting internet outages around the time of the Level 3 outages on Monday, but the company said that it was monitoring “an external network issue” and not a problem with its own infrastructure. RCN confirmed that it had some network problems on Monday because of Level 3. The company said it had restored RCN service by rerouting traffic to a different backbone.
The misconfiguration was a “route leak,” according to Roland Dobbins, a principal engineer at the DDoS and network-security firm Arbor Networks, which monitors global internet operations. ISPs use “Autonomous Systems,” also known as ASes, to keep track of what IP addresses are on which networks, and route packets of data between them. They use the Border Gateway Protocol (BGP) to establish and communicate routes. For example, packets can route between networks A and B, but network A can also route packets to network C through network B, and so on. This is how internet service providers interoperate to let you browse the whole internet, not just the IP addresses on their own networks.
In a “route leak,” an AS, or multiple ASes, issue incorrect information about the IP addresses on their network, which causes inefficient routing and failures for both the originating ISP and other ISPs trying to route traffic through. Think of it like a series of street signs that help keep traffic flowing in the right directions. If some of them are mislabeled or point the wrong way, assorted chaos can ensue.
Route leaks can be malicious, sometimes called “route hijacks” or “BGP hijacks,” but Monday’s incident seems to have been caused by a simple mistake that ballooned to have national impact. Large outages caused by accidental route leaks have cropped up before.
“Folks are looking to tweak routing policies, and make mistakes,” Arbor Networks’ Dobbins says. The problem could have come as CenturyLink works to integrate the Level 3 network or could have stemmed from typical traffic engineering and efficiency work.
Internet outages of all sizes caused by route leaks have occurred occasionally, but consistently, for decades. ISPs attempt to minimize them using “route filters” that check the IP routes their peers and customers intend to use to send and receive packets and attempt to catch any problematic plans. But these filters are difficult to maintain on the scale of the modern internet and can have their own mistakes.
Monday’s outages reinforce how precarious connectivity really is, and how certain aspects of the internet’s architecture—offering flexibility and ease-of-use—can introduce instability into what has become a vital service.