Typo Set In Motion Chain Of Events That Shut Down AWS S3 Cloud

While the typo may indeed have triggered the outage, by itself it should not have caused the disruption it did. Another important factor was massive growth of the S3 storage service that was greater than Amazon had expected.

This meant that the S3 service had not been scaled appropriately for its user load. Because it had grown larger than anticipating, a system restart took much longer than expected, which in turn kept the critical S3 systems from being restored as fast as expected.

Essentially S3 growth had outpaced Amazon’s ability to partition the system into smaller segments that could be restarted quickly. The storage service failures were compounded by the fact that some internal systems such as Service Health Dashboard, also depended on the S3 services that were down.

AWS outage

As a result, the dashboard was telling AWS customers the system was running normally even as their business critical web applications crashed and were inaccessible. The typo in a single command during a debugging attempt initiated a cascading series of failures that knocked the S3 services offline for hours.

But if S3’s underlying configuration problems didn’t exist, the typo would have been a minor occurrence, probably one that would have gone without notice. But that’s not what happened.

But the failure had one other effect that was equally remarkable. Amazon conducted a detailed investigation to determine what caused the outage, which should provide valuable lessons on how the company can avoid similar failures in the future even if Amazon’s cloud service continues to grow at its current break-neck pace.

What was even more remarkable was the way Amazon was transparent about the investigation and the causes of the system failure. And finally, as should be done following an investigation into a serious accident, Amazon turned its findings into a series of steps to try to ensure this specific failure would never happen again.

That final step of fixing all the things that made it possible for the accident to happen is not necessarily quick or easy. In many industries including the airlines where a single faulty part or a single action can cause catastrophic loss of life, problems often go unfixed for years while companies dither and regulators ponder.

It’s not just the airlines. For example, despite a number of fatal railway accidents over the past few years—Positive Train Control is nowhere near universal on U.S. railways, despite increasingly urgent recommendations by the NTSB.

So if Amazon can claim a victory out of this very expensive cloud system crash it’s that it quickly determined the entire chain of events leading to the accident and they immediately started fixing them all.

This is not to suggest that there will never be another Amazon Web Services outage. Any system so complex will eventually develop a new set of problem. But it’s safe to say that that the same sequence of events won’t happen again.

Originally published on eWeek

Quiz: Everything you should know about AWS!

Page: 1 2

Wayne Rash

Wayne Rash is senior correspondent for eWEEK and a writer with 30 years of experience. His career includes IT work for the US Air Force.

Recent Posts

Tech Minister Admits UK Social Media Ban For Under-16s “On The Table”

Following Australia? Technology secretary Peter Kyle says possible ban on social media for under-16s in…

2 days ago

Northvolt Appoints Restructuring Expert For Main Battery Plant

Restructuring expert appointed to oversea Northvolt's main facility in northern Sweden, amid financial worries

2 days ago

CMA Halts Google Anthropic Investigation

British competition watchdog decides Alphabet's partnership with AI startup Anthropic does not qualify for investigation

2 days ago

Germany “Deeply Concerned” After Damage To Two Undersea Cables

Possible sabotage? Two undersea cables in the Baltic sea have been severely damaged, triggering security…

2 days ago