Facebook Crashes Twice In A Comedy Of Errors
Facebook has apologised for an outage yesterday that cut off “many” of its users, caused by mishandling an error code
Facebook went down on Thursday for two and a half hours because of a mishandling of an error condition in the social network’s system.
Web performance management company AlertSite logged that the site availability dropped to 38.46 per cent yesterday evening. Robert Johnson, director of software engineering at Facebook, wrote an apology to the affected users and detailed the problem.
Errors Flagging Errors
Basically, a routine used to handle invalid data found during error-checking was itself interpreted as in error. This caused the system to try to replace it. It could only use replacement code that was the same as the flagged routine. On top of that, the checker was still receiving routine calls from the rest of the system, grinding the whole system to a halt.
From the user viewpoint, their only friend on Facebook was a message saying that there was a “DNS error”. For Facebook’s IT team, it meant a few red faces in their new green data centre.
The error-checker, unsurprisingly, found that too to be in error and so an infinite loop began. A classic case of a developer not thinking outside the box and a literal comedy of errors resulting from it.
“The way to stop the feedback cycle was quite painful,” Johnson wrote, “We had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.”
Facebook engineers have yet to provide a fix for the condition, In the meantime, the reconfiguration module has been switched out. Presumably, Facebook executives have crossed their fingers that this will not adversely affect the system again.
Johnson’s missive ends: “We apologise again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously.”
It is the worst outage that Facebook has had in the past four years but it is also the second in two days. Yesterday’s problem was a lot shorter, affected fewer people and was put down to issues at a third-party networking provider.