That big Amazon S3 outage was caused by a typo, company admits

jerwin · March 2, 2017, 11:12pm

Real computer users look at the keyboard-- not at the screen.

stryxvaria · March 2, 2017, 11:21pm

In the beginning was the script

But I never blamed the Hole Hawg; I blamed myself. The Hole Hawg is dangerous because it does exactly what you tell it to. It is not bound by the physical limitations that are inherent in a cheap drill, and neither is it limited by safety interlocks that might be built into a homeowner’s product by a liability-conscious manufacturer. The danger lies not in the machine itself but in the user’s failure to envision the full consequences of the instructions he gives to it.

oldtaku · March 2, 2017, 11:53pm

rm three *

anon33466019 · March 2, 2017, 11:57pm

s3cmd del --recursive s3://*

OtherMichael · March 3, 2017, 12:29am

‘Some people, when confronted with a problem, think “I know, I’ll use Amazon S3.” Now they have two problems.’

ficuswhisperer · March 3, 2017, 12:32am

Could be worse. Could have been one of the many times a DNS typo brought down the entire Internet.

ficuswhisperer · March 3, 2017, 12:33am

He’s probably a Chad.

anon33466019 · March 3, 2017, 12:55am

For the last time, that was Jeff, not me!!

Gutierrez · March 3, 2017, 1:41am

I worked for a company that made software to monitor funds and securities. We would run a series of tests on our servers after an upgrade before putting them back online.

Being new I didn’t realize there were test settings that had to be swapped in on the server before a run. I started the tests with the server not pointed to our dummy data services, nope, still pointed at live sources like Bloomberg.

A few scaling and load tests later. I ran up over $15,000 in data transaction fees and concerned emails started to come in from those sources clueing me in to the huge mistake I had just made.

This resulted in a massive panic attack in which I pondered many past mistakes leading up to this point, future career paths outside the software industry and my own mortality.

A phone call or two and some emails from my manager later the charges were reversed, fears were assuaged and I somehow still had a job.

So remember kids: get review before you commit.

bobbymartin2 · March 3, 2017, 1:51am

Bezos is smart enough to know that giving people a huge incentive to be dishonest about what went wrong is a bad idea.

The Amazon root cause analysis process doesn’t even name the person who made the mistake. I doubt that person was fired because of one error. If it was part of a pattern, well…

MrHarley · March 3, 2017, 2:16am

Typo, yes it was a “typo”.

Nonentity · March 3, 2017, 3:02am

I have actually managed to recover systems where the root user accidentally typed “rm -rf /" instead of "rm -rf ./”.

One of the reasons I never underestimate the dangers of hitting the return key without first reading the command…

[edit] The trouble of recovering it after someone else screwed up being the reason, not that I did it. Just realized that could be read differently…

HSG_UCARE · March 3, 2017, 4:11am

In light of this AWS S3 outage, we, CS researchers at Univ. of Chicago, recently published a paper about:
‘Why Does the Cloud Stop Computing’. You can find our paper and slides here:

Paper: http://ucare.cs.uchicago.edu/pdf/socc16-cos.pdf
Slide: http://ucare.cs.uchicago.edu/slides/socc16-cos.pptx

ficuswhisperer · March 3, 2017, 4:29am

Just remember

Michael_R_Smith · March 3, 2017, 9:04am

I have had a couple of users at my site destroy EC2 instances by chown -R / apparently trying to resolve permission problems. System was okay but I couldn’t get back in to make repairs.

TripleE · March 3, 2017, 3:42pm

I once corrupted a production db due to a typo, and then found out we had no backups.

I feel vindicated.

anotherone · March 3, 2017, 9:05pm

Every time a cloud provider approaches me pitching their service I ask for a copy of their DR plan, testing, and estimated downtime during a disaster - information which all of my clients demand from me before a contract is finalized. Oddly, not a single one has ever supplied one.

jra · March 4, 2017, 2:24am

Did you know that an Oracle DB continues to process transactions just fine after you type “newfs” on it’s data partition? I didn’t know that, until I did it. Oops. Still, we moved the live traffic to another server ASAP. Sweatin’ bullets.

-jeff

stinkinbadgers · March 5, 2017, 1:15am

You can tell he’s a serious computer operator by the turtleneck.

TheGreatParis · March 5, 2017, 2:10am

“Don’cha just know it?”

Topic		Replies	Views
Parler loses fight to force Amazon to restore service boing	36	1299	January 27, 2021
Person tests Amazon's "unlimited" cloud storage by uploading 1.8 petabytes of porn boing	49	5623	August 30, 2017
Happy Sysadmin Day, Ken! boing	6	1178	July 25, 2014
Major U.S. websites inaccessible in "cyber attack" on domain name system boing	49	4379	October 26, 2016
After Tay's very public crazy racist Nazi sexbot breakdown, Microsoft's like, 'Tay-a culpa, guys' boing	89	5106	March 30, 2016

That big Amazon S3 outage was caused by a typo, company admits

Related topics