Categories: CloudCloud Management

Microsoft Has Revealed The Cause Of November’s Mass Outage

Microsoft has published a detailed analysis of what caused the mass Azure outage in November.

The outage, which started on November 18, and appeared to last for up to two days for some regions, knocked out Azure storage, websites, and other services across US and Northern Europe. Even MSN and Xbox live were down, affecting thousands of users.

Route Cause Analysis

Writing on the Azure blog, Jason Zander, the CVP for Azure, said: “On November 18, 2014, many of our Microsoft Azure customers experienced a service interruption that impacted Azure Storage and several other services, including Virtual Machines.

“Today, we’re sharing our final RCA (route cause analysis), which includes a comprehensive outline of steps we’ve taken to mitigate against this situation happening again, as well as steps we’re taking to improve our communications and support response. We sincerely apologise and recognise the significant impact this service interruption may have had on your applications and services.”

It appears that the cause was a dodgy bit of ‘flighting’, which is effectively the term of testing new updates and codes.

Zander explained: “There are two types of Azure Storage deployments: software deployments (i.e. publishing code) and configuration deployments (i.e. change settings). Both software and configuration deployments require multiple stages of validation and are incrementally deployed to the Azure infrastructure in small batches. This progressive deployment approach is called ‘flighting.’ When flights are in progress, we closely monitor health checks. As continued usage and testing demonstrates successful results, we will deploy the change to additional slices across the Azure Storage infrastructure.”

The RCA found that a change had already been flighted on a portion of the production infrastructure for several weeks.

“Unfortunately, the configuration tooling did not have adequate enforcement of this policy of incrementally deploying the change across the infrastructure,” said Zander.

To make sure it doesn’t happen again, Microsoft said it has “released an update to our deployment system tooling to enforce compliance to the above testing and flighting policies for standard updates, whether code or configuration”.

Are you a Windows expert? Take our Windows XP quiz!

Ben Sullivan

Ben covers web and technology giants such as Google, Amazon, and Microsoft and their impact on the cloud computing industry, whilst also writing about data centre players and their increasing importance in Europe. He also covers future technologies such as drones, aerospace, science, and the effect of technology on the environment.

Recent Posts

Apple Sales Rise 6 Percent After Early iPhone 16 Demand

Fourth quarter results beat Wall Street expectations, as overall sales rise 6 percent, but EU…

24 hours ago

X’s Community Notes Fails To Stem US Election Misinformation – Report

Hate speech non-profit that defeated Elon Musk's lawsuit, warns X's Community Notes is failing to…

1 day ago

Google Fined More Than World’s GDP By Russia

Good luck. Russia demands Google pay a fine worth more than the world's total GDP,…

1 day ago

Spotify, Paramount Sign Up To Use Google Cloud ARM Chips

Google Cloud signs up Spotify, Paramount Global as early customers of its first ARM-based cloud…

2 days ago

Meta Warns Of Accelerating AI Infrastructure Costs

Facebook parent Meta warns of 'significant acceleration' in expenditures on AI infrastructure as revenue, profits…

2 days ago

AI Helps Boost Microsoft Cloud Revenues By 33 Percent

Microsoft says Azure cloud revenues up 33 percent for September quarter as capital expenditures surge…

2 days ago