That big Amazon S3 outage was caused by a typo, company admits

Real computer users look at the keyboard-- not at the screen.

7 Likes

In the beginning was the script

But I never blamed the Hole Hawg; I blamed myself. The Hole Hawg is dangerous because it does exactly what you tell it to. It is not bound by the physical limitations that are inherent in a cheap drill, and neither is it limited by safety interlocks that might be built into a homeowner’s product by a liability-conscious manufacturer. The danger lies not in the machine itself but in the user’s failure to envision the full consequences of the instructions he gives to it.

1 Like

rm three *

4 Likes

s3cmd del --recursive s3://*

3 Likes

‘Some people, when confronted with a problem, think “I know, I’ll use Amazon S3.” Now they have two problems.’

9 Likes

Could be worse. Could have been one of the many times a DNS typo brought down the entire Internet.

1 Like

He’s probably a Chad.

3 Likes

For the last time, that was Jeff, not me!!

3 Likes

I worked for a company that made software to monitor funds and securities. We would run a series of tests on our servers after an upgrade before putting them back online.

Being new I didn’t realize there were test settings that had to be swapped in on the server before a run. I started the tests with the server not pointed to our dummy data services, nope, still pointed at live sources like Bloomberg.

A few scaling and load tests later. I ran up over $15,000 in data transaction fees and concerned emails started to come in from those sources clueing me in to the huge mistake I had just made.

This resulted in a massive panic attack in which I pondered many past mistakes leading up to this point, future career paths outside the software industry and my own mortality.

A phone call or two and some emails from my manager later the charges were reversed, fears were assuaged and I somehow still had a job.

So remember kids: get review before you commit.

13 Likes

Bezos is smart enough to know that giving people a huge incentive to be dishonest about what went wrong is a bad idea.

The Amazon root cause analysis process doesn’t even name the person who made the mistake. I doubt that person was fired because of one error. If it was part of a pattern, well…

2 Likes

Typo, yes it was a “typo”.

7 Likes

I have actually managed to recover systems where the root user accidentally typed “rm -rf /" instead of "rm -rf ./”.

One of the reasons I never underestimate the dangers of hitting the return key without first reading the command…

[edit] The trouble of recovering it after someone else screwed up being the reason, not that I did it. Just realized that could be read differently… :smiley:

5 Likes

In light of this AWS S3 outage, we, CS researchers at Univ. of Chicago, recently published a paper about:
‘Why Does the Cloud Stop Computing’. You can find our paper and slides here:

Paper: http://ucare.cs.uchicago.edu/pdf/socc16-cos.pdf
Slide: http://ucare.cs.uchicago.edu/slides/socc16-cos.pptx

Just remember

12 Likes

I have had a couple of users at my site destroy EC2 instances by chown -R / apparently trying to resolve permission problems. System was okay but I couldn’t get back in to make repairs.

1 Like

I once corrupted a production db due to a typo, and then found out we had no backups.

I feel vindicated.

2 Likes

Every time a cloud provider approaches me pitching their service I ask for a copy of their DR plan, testing, and estimated downtime during a disaster - information which all of my clients demand from me before a contract is finalized. Oddly, not a single one has ever supplied one.

3 Likes

Did you know that an Oracle DB continues to process transactions just fine after you type “newfs” on it’s data partition? I didn’t know that, until I did it. Oops. Still, we moved the live traffic to another server ASAP. Sweatin’ bullets.

-jeff

You can tell he’s a serious computer operator by the turtleneck.

“Don’cha just know it?”

1 Like