Facebook Outage. What went wrong?

A blast from the past

March 13th, 2019 turned out to be a nightmare for Facebook and its affiliates. The Social network suffered a major outage, which also hit related products Instagram, Messenger, and WhatsApp and lasted for at least 14 hours. The incident hit millions of users across the globe and cost approximately $90 million, while the company’s shares price dropped 2,3%. Two and a half years later, Facebook disappeared from the Internet again, disrupting individual users and businesses relying on Facebook’s network instead of their websites.

What happened on October 4th?

Last Monday, Facebook suffered one of its worst outages which lasted about six hours and affected Instagram and WhatsApp. The disruption has been caused by a change in Facebook’s server configuration, proving that the Internet is quite delicate. For external users, an incident looked like Facebook, Instagram and WhatsApp disappeared from the Internet.

In an update on the outage, Facebook said that it was caused by “configuration changes on the backbone routers that coordinate network traffic between our data centers,” blocking their ability to communicate and setting off a cascade of network failures. That explanation suggests the problem arose between Facebook and the Border Gateway Protocol, a vital tool underlying the Internet.

DNS, BGP, and a global blackout.

The Internet is a network of networks, and it’s bound together by BGP, which stands for Border Gateway Protocol. It allows one network (say Facebook) to advertise its presence to other networks that form the Internet. A Border Gateway Protocol is often compared with a Postal ZIP code because it tells the rest of the world where to route traffic and information. Without BGP, the Internet routers wouldn’t know what to do, and the Internet wouldn’t work.

The Domain Name System (DNS) is a sort of a phonebook of the Internet. In simple words, it lets users connect to websites using domain names instead of IP addresses.
At some point, Facebook had stopped announcing the routes to their DNS prefixes. The company explained this by a necessity to execute “the extensive day-to-day work of maintaining our infrastructure. During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally”.

Due to Facebook stopping announcing their DNS prefix routes through BGP, users’ DNS resolvers could not connect to their nameservers and started issuing SERVFAIL responses.

Why did it take so long to get the system back on track?

As Facebook explains, two large obstacles were stopping their engineers from fixing the issue: first, it was not possible to access FB data centers through the usual means because their networks were down, and second, the total loss of DNS broke many of the internal tools company usually uses to investigate and resolve outages.
After the backbone network connectivity was restored across Facebook data centers, everything came back up with it.

Is there any hack trace behind this outage?

Despite various conspiracy theories, Facebook official representatives underlined that, “There was no malicious activity behind this outage — its root cause was a faulty configuration change on our end. We also have no evidence that user data was compromised as a result of this downtime”.