Facebook Crashes Twice In A Comedy Of Errors

Facebook went down on Thursday for two and a half hours because of a mishandling of an error condition in the social network’s system.

Web performance management company AlertSite logged that the site availability dropped to 38.46 per cent yesterday evening. Robert Johnson, director of software engineering at Facebook, wrote an apology to the affected users and detailed the problem.

Errors Flagging Errors

Basically, a routine used to handle invalid data found during error-checking was itself interpreted as in error. This caused the system to try to replace it. It could only use replacement code that was the same as the flagged routine. On top of that, the checker was still receiving routine calls from the rest of the system, grinding the whole system to a halt.

From the user viewpoint, their only friend on Facebook was a message saying that there was a “DNS error”. For Facebook’s IT team, it meant a few red faces in their new green data centre.

The error-checker, unsurprisingly, found that too to be in error and so an infinite loop began. A classic case of a developer not thinking outside the box and a literal comedy of errors resulting from it.

“The way to stop the feedback cycle was quite painful,” Johnson wrote, “We had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.”

Facebook engineers have yet to provide a fix for the condition, In the meantime, the reconfiguration module has been switched out. Presumably, Facebook executives have crossed their fingers that this will not adversely affect the system again.

Johnson’s missive ends: “We apologise again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously.”

It is the worst outage that Facebook has had in the past four years but it is also the second in two days. Yesterday’s problem was a lot shorter, affected fewer people and was put down to issues at a third-party networking provider.


Eric Doyle, ChannelBiz

Eric is a veteran British tech journalist, currently editing ChannelBiz for NetMediaEurope. With expertise in security, the channel, and Britain's startup culture, through his TechBritannia initiative

Recent Posts

Craig Wright Sentenced For Contempt Of Court

Suspended prison sentence for Craig Wright for “flagrant breach” of court order, after his false…

3 days ago

El Salvador To Sell Or Discontinue Bitcoin Wallet, After IMF Deal

Cash-strapped south American country agrees to sell or discontinue its national Bitcoin wallet after signing…

3 days ago

UK’s ICO Labels Google ‘Irresponsible’ For Tracking Change

Google's change will allow advertisers to track customers' digital “fingerprints”, but UK data protection watchdog…

3 days ago

EU Publishes iOS Interoperability Plans

European Commission publishes preliminary instructions to Apple on how to open up iOS to rivals,…

4 days ago

Momeni Convicted In Bob Lee Murder

San Francisco jury finds Nima Momeni guilty of second-degree murder of Cash App founder Bob…

4 days ago