CloudFlare Gets Caught Out By 2016 Leap Second
The leap second caused CloudFlare’s RRDNS software to “panic,” but the error was quickly identified
The extra leap second added on to the end of 2016 may not have had an effect on most people, but it did catch out a few web companies who failed to factor it in.
Web services and security firm CloudFlare was one such example. A small number of its servers went down at midnight UTC on New Year’s Day due to an error in its RRDNS software, a domain name service (DNS) proxy that was written to help scale CloudFlare’s DNS infrastructure, which limited web access for some of its customers.
As CloudFlare explained in a blog post, a number went negative in the software when it should of been zero, causing RRDNS to “panic” and affect the DNS resolutions to some websites.
Time warp
The problem only affected “a small number” of CloudFlare customers using CNAME DNS records, with “approximately 0.2 percent” of DNS queries being affected and errors only occurring in less than one percent of all HTTP requests to Cloudflare.
The issue was confirmed by the company’s engineers at 00:34 UTC on New Year’s Day and the fix – which involved patching the clock source to ensure it normalises if time ever skips backwards – was rolled out to the majority of the affected data centres by 02:50 UTC.
CloudFlare’s CTO John Graham-Cumming said: “This problem was quickly identified. The most affected machines were patched in 90 minutes and the fix was rolled out worldwide by 0645 UTC.
“We are sorry that our customers were affected by this bug and are inspecting all our code to ensure that there are no other leap second sensitive uses of time intervals.”
This is not the first time a leap second has been an issue. Back in 2012 an extra second caused technical problems for the likes of LinkedIn, FourSquare and Reddit, resulting in IT organisations bracing themselves for panic ahead of a leap second addition in 2015.