Microsoft Offers Refund For Azure’s Leap Year Outage

Microsoft has explained the Leap Year Bug which brought down its Windows Azure cloud service, and has promised a 33 percent credit to all customers whether or not they lost service.

Bill Laing, corporate vice president, server and cloud, at Microsoft made the admission in a blog post which revealed that the root cause was an issue with certificates. Although the problem was understood quickly, a cascade of interactions between different operations contributed to the enormity of the disruption.

Unhealthy servers

Laing explained that Azure’s servers are organised into “clusters” of about 1,000 whose health is monitored by platform software called the Fabric Controller (FC) in an effort to isolate certain classes of error.

Azure’s Platform as a Service (PaaS) functionality requires its tight integration with applications that run in virtual machines (VMs) through the use of a guest agent (GA) that it deploys into the VMs. Each server has a host agent (HA) that the FC uses to deploy application secrets such as SSL certificates with the GA to see if the VM is healthy or if the FC should take recovery actions.

These secrets are encrypted, so the GA creates a transfer certificate when it initialises. New transfer certificates are issued when a new VM is created, when a deployment scales out, or when a deployment updates its VM OS. A new certificate is also issued when the FC reincarnates a VM on a healthy server if it deems the old server to be “unhealthy”.

Cluster bomb

However when a certificate is issued, it is given a one year validity range starting from midnight UST. The Azure code generated the end-date by adding a one to the year, so any certificate issued on leap day (29 February) had a an invalid end date of 29 February 29 2013, .

If a certificate cannot be issued, the GA restarts the VM. On leap day, of course once restarted, the same problem happened, causing a series of restarts. If a VM restarts three times in quick succession, this causes the HA to decide there is a hardware problem; it reports the server to a ‘Human Investigate’ status. The HA closed those servers, and the FC attempted to reincarnate any VMs stored on the failed server, thus spreading the bug.

If a threshold is reached, the FC will move an entire cluster to HI status, allowing operators  to repair it, something which happened shortly after the bug was triggered at 4:00am PST on February 28 (00:00 UST February 29).

Many clusters were in the middle of the rollout of a new version of the FC, HA and GA, which meant the bug hit immediately. The bug was identified at 6:28pm PST and Microsoft disabled management functionality in all clusters by 6:55pm PST, the first time it has ever taken such action.

Improved transparency

Microsoft was able to restore service management to all clusters by 2:11am PST on 29 February, but a number of servers which were updating when the bug occurred were incompatible with the fix, meaning that another solution was required before all services were healthy by 2:15am PST on 1 March.

“We know that many of our customers were impacted by this event and we want to be transparent about what happened, what issues we found, how we plan to address these issues, and how we are learning from the incident to prevent a similar occurrence in the future,” commented Laing.

“Due to the extraordinary nature of this event, we have decided to provide a 33 percent credit to all customers of Windows Azure Compute, Access Control, Service Bus and Caching for the entire affected billing month(s) for these services, regardless of whether their service was impacted,” he added. “These credits will be applied proactively and will be reflected on a billing period subsequent to the affected billing period.  Customers who have additional questions can contact support for more information.”

How much do you know about cloud computing? Find out with our quiz

Steve McCaskill

Steve McCaskill is editor of TechWeekEurope and ChannelBiz. He joined as a reporter in 2011 and covers all areas of IT, with a particular interest in telecommunications, mobile and networking, along with sports technology.

Recent Posts

Australia Rejects Elon Musk Claim About Social Media Ban For Under-16s

Government minister flatly rejects Elon Musk's “unsurprising” allegation that Australian government seeks control of Internet…

39 mins ago

Northvolt Files For Bankruptcy Protection In US

Northvolt files for Chapter 11 bankruptcy protection in the United States, and CEO and co-founder…

2 hours ago

UK’s CMA Readies Cloud Sector “Behavioural” Remedies – Report

Targetting AWS, Microsoft? British competition regulator soon to announce “behavioural” remedies for cloud sector

18 hours ago

Former Policy Boss At X Nick Pickles, Joins Sam Altman Venture

Move to Elon Musk rival. Former senior executive at X joins Sam Altman's venture formerly…

20 hours ago

Bitcoin Rises Above $96,000 Amid Trump Optimism

Bitcoin price rises towards $100,000, amid investor optimism of friendlier US regulatory landscape under Donald…

21 hours ago

FTX Co-Founder Gary Wang Spared Prison

Judge Kaplan praises former FTX CTO Gary Wang for his co-operation against Sam Bankman-Fried during…

22 hours ago