Facebook Data Centre Engineer Develops Troubleshooting Tool

A Facebook data centre engineer has built a monitoring tool called Claspin that uses heat maps to troubleshoot potential problems in data centres.

The engineer, Sean Lynch, described in a blog posting how he and his fellow engineers need to ensure the health of the social networking giant’s cache systems by quickly identifying and fixing any potential problems with server, racks or clusters.

Facebook Data Centre

Facebook has two major cache systems said Lynch. “Memcache is a simple lookaside cache with most of its smarts in the client, and TAO, a caching graph database that does its own queries to MySQL.”

“Between these two systems, we have literally thousands of charts, some of which are collected into dashboards showing various latency, request rate, and error rate statistics collected by clients and servers,” wrote Lynch.

Lynch described how these dashboards worked well at first, but as Facebook grew and its systems became increasingly complex, it became more and more difficult to figure out which piece was broken when something went wrong. He then started to think about a tool or system that would provide quick visual insights into the status of cache, “analogous to meters and traffic lights.”

Lynch explained how he named Claspin thanks to a suggestion from a friend who had a background in organic chemistry. “Claspin” is a protein that monitors for DNA damage in a cell.

Heatmap Tool

Lynch’s first attempt to build a tool resulted in a command line tool that outputted a lot of text. But Lynch wanted something to visually represent potential problems and settled on the idea of heatmaps.

“I’d been fond of heatmaps for quite a while, but it wasn’t entirely clear to me how to lay out this data in two dimensions in a way that would be meaningful to the user. It seemed somewhat obvious that we wanted each “pixel” of the heatmap to represent a host, with racks grouped together,” wrote Lynch. “However, our racks don’t necessarily have the same number of hosts in them, and it wasn’t clear how to colour individual hosts when we have about a dozen metrics for each. Eventually I realised that all we cared about was whether anything was wrong with a host. So I settled on colouring a host by its “hottest” statistic, with hotness computed from predefined thresholds. It’s dirt simple, but it gives us a way to encode tribal knowledge about what values are “bad” into the view.”

Lynch said that hosts that are missing a stat are coloured black, indicating that the host is probably down.

He eventually settled on a separate heatmap per cluster, ordered by rack number and with each rack drawn vertically in an alternating “snake” pattern so racks would stay contiguous even if they wrapped around the top or bottom. “The rack names naturally sort by data centre, then cluster, then row, so problems common at any of these levels are readily apparent,” he wrote.

“Claspin allows us to visualise a ridiculous amount of information at once, in a way that makes it easy to spot problems and patterns,” Lynch said. “On a 30″ screen we could easily fit 10,000 hosts at the same time, with 30 or more stats contributing to their colour, updated in real time – usually in a matter of seconds or minutes.”

Open Source?

Although Claspin is geared towards to Facebook’s own internal hardware configurations, the social network giant is reportedly considering offering the tool as an open source option. In the past for example, it has open sourced other internal tools.

Facebook also launched its Open Compute Project in April 2011, after it built a highly efficient data centre in Prineville, Oregon. That source project first revealed its original Open Rack specification (version 0.5) back in December 2011 and earlier this week it released the second version, dubbed the v1.0 Open Rack specification.

Do you know all about Green IT? Take our quiz!

Tom Jowitt

Tom Jowitt is a leading British tech freelancer and long standing contributor to Silicon UK. He is also a bit of a Lord of the Rings nut...

Recent Posts

Apple Sales Rise 6 Percent After Early iPhone 16 Demand

Fourth quarter results beat Wall Street expectations, as overall sales rise 6 percent, but EU…

17 hours ago

X’s Community Notes Fails To Stem US Election Misinformation – Report

Hate speech non-profit that defeated Elon Musk's lawsuit, warns X's Community Notes is failing to…

18 hours ago

Google Fined More Than World’s GDP By Russia

Good luck. Russia demands Google pay a fine worth more than the world's total GDP,…

19 hours ago

Spotify, Paramount Sign Up To Use Google Cloud ARM Chips

Google Cloud signs up Spotify, Paramount Global as early customers of its first ARM-based cloud…

2 days ago

Meta Warns Of Accelerating AI Infrastructure Costs

Facebook parent Meta warns of 'significant acceleration' in expenditures on AI infrastructure as revenue, profits…

2 days ago

AI Helps Boost Microsoft Cloud Revenues By 33 Percent

Microsoft says Azure cloud revenues up 33 percent for September quarter as capital expenditures surge…

2 days ago