On the evening of Monday, October 4, one of the biggest crashes in the history of Facebook happened, making Facebook, the Messenger platform, Instagram, and WhatsApp inaccessible to more than a billion users.
Facebook has concluded that the outage began in the system that manages their global backbone network*. The backbone network connects all their data centers and handles large amounts of in- and outgoing data across the globe.
Facebook writes about the outage: "in the extensive day-to-day work of maintaining this infrastructure, our engineers often need to take part of the backbone offline for maintenance — perhaps repairing a fiber line, adding more capacity, or updating the software on the router itself. (…)”
It was in connection with such a routine operation that a command was issued by mistake, which caused the entire backbone network to shut down. Usually there is audit software that should prevent such commands, but in this case this also failed.
This gave rise to another problem, as the crash meant that Facebook's own DNS servers became unreachable, making it impossible for the rest of the Internet to find them.
The outage meant, that online access to Facebooks data centers was impossible, and the technicians were forced to access them physically. It took time for them, to get inside, due to strict security measures. At the same time, Facebook adds: "(….) Once inside the data center, hardware and routers are designed to be difficult to work with - even when you have physical access to them".
It therefore took extra-long before Facebook could ascertain what had happened, solve the problems, and get their backbone network on and running again.
Read Facebooks version of events here.
*The backbone network is part of the infrastructure that connects all of Facebook's data facilities across the globe.
* Facebooks data centers come in different forms. Some are massive buildings that house millions of machines that store data and run the heavy computational loads that keep our platforms running, and others are smaller facilities that connect our backbone network to the broader internet and the people using our platforms.