Typo Set In Motion Chain Of Events That Shut Down AWS S3 Cloud

You hear about accident investigations on a regular basis. When an airliner goes down, or a train comes off the rails or any other serious accident, an investigation starts along with the grim task of recovering the dead and injured.

Usually, there will be a briefing by the investigating authority at the start and then you won’t hear anything for months. Few people know is what the investigators are even looking for.

That’s because it can take months for the investigators to go through every detail before determining what caused the accident.

Inside AWS outage

The investigations are elaborate because there’s rarely a single cause to a serious accident. Eventually the investigation will show that a sequence of events occurred and it’s possible that the accident could have been prevented if any one of those event had changed.

Investigations of this type actually happen for accidents of all sorts, not just transportation catastrophes. Companies and regulators follow similar procedures for a wide variety of unplanned events.

In fact, companies will launch such an investigation when an accident causes a major loss, such as the outage that took out Amazon Web Services and its S3 storage services on February 28, which explains why the company undertook one.

I observed this first-hand in the late spring of 1971, when I was sent up a mountain near Roanoke, Virginia, to cover an airplane crash for the television station where I’d just started working. On that mountain, World War II hero and Hollywood actor Audie Murphy and five others had died as the airplane in which they were riding slammed into the top of a fog shrouded mountain.

Around me as I climbed the side of the mountain with the rest of the news crew were representatives from the National Transportation Safety Board, already taking photos and making measurements of the crash site. Later, they would take all the components they could find of the shattered aircraft to a hanger for examination and further investigation.

Investigation

To me, as I reported from that mountainside, the reason for the crash seemed obvious. The pilot must have been lost in the fog, and failed to see the mountain. But the truth was much more complicated than that.

The investigators had to learn why the pilot been lost like that near a major airport? Why hadn’t he performed an instrument landing at the major airport nearby after the weather had turned bad? The questions were eventually answered, and ultimately a lesson was learned.

Fortunately, not every accident results in tragic deaths. But every serious accident must be investigated to learn how it happened and how it can be prevented from happening again.

This was the case with the Feb. 28 event when Amazon Web Service’s S3 storage services shut down for hours. This time the losses measured not in lives, but in millions of dollars lost by Amazon and clients because of the down time. Clearly an investigation was in order.

But as Amazon explained in a report it released on March 2 along with an apology to its customers, it was of chain of events that started with the smallest of errors, a typo in a server update command.

Originally published on eWeek

Page: 1 2

Wayne Rash

Wayne Rash is senior correspondent for eWEEK and a writer with 30 years of experience. His career includes IT work for the US Air Force.

Recent Posts

UK’s CMA Readies Cloud Sector “Behavioural” Remedies – Report

Targetting AWS, Microsoft? British competition regulator soon to announce “behavioural” remedies for cloud sector

6 hours ago

Former Policy Boss At X Nick Pickles, Joins Sam Altman Venture

Move to Elon Musk rival. Former senior executive at X joins Sam Altman's venture formerly…

9 hours ago

Bitcoin Rises Above $96,000 Amid Trump Optimism

Bitcoin price rises towards $100,000, amid investor optimism of friendlier US regulatory landscape under Donald…

10 hours ago

FTX Co-Founder Gary Wang Spared Prison

Judge Kaplan praises former FTX CTO Gary Wang for his co-operation against Sam Bankman-Fried during…

11 hours ago