The "ops lessons we all learn the hard way"

Originally published at:


Anything can be fixed by another layer of abstraction.


#52 - Schrödinger’s Backup – “The condition of any backup is unknown until a restore is attempted.” – is overly optimistic.

Unfortunately I have had first hand experience with this situation.


16. Very few operations are truly idempotent.

I think this may be a fundamental law of the universe. Seriously.

37. Nobody knows how git works; everybody simply rm -fr && git checkout’s periodically.

I always suspected this, but was too embarrassed to find out.

38. There are very few network restrictions creative and determined use of ssh(1) port forwarding can’t overcome.

In high school, I awed many of my peers by SSH tunneling past the school’s firewall. They naively assumed I could hack anything after that, and I had to turn down more than a few requests to change folks’ grades.

46. Some of your most critical services are kept alive by a handful of people whose job description does not mention those services at all.

I’m one of those people! I need a footnote in my email sig saying that my job title is not an exhaustive description of what I do (or can do).

52. Schrödinger’s Backup – “The condition of any backup is unknown until a restore is attempted.” – is overly optimistic.

This gives me chills.

87. Simplicity is King.

And if you just plan to simplify something in the future, it’s already too late.


The severity of an incident is measured by the number of rules broken in resolving it.

This is why I disagree with people who say “we can’t make X rule; then what would happen in Y unlikely emergency?!” In an important enough emergency you’ll break the rules and deal with the consequences.


Number 38 can be picked up by L7 firewalls if you’re lazy enough to not wrap TLS around your SSH tunnels. Even then, if the firewall is using SSL decryption with some degree of competency (ie have your approved non-decrypt traffic and block all others that won’t decrypt, such as if they have pinned certs) you can reduce this further.

It’s not 100% blocking, but you’re challenging the one aspect that does matter - do they care enough to put the effort into getting around it? If they can’t be bothered with the hassle, chalk that up as a W and keep an eye out for traffic that behaves like SSH but doesn’t present itself as so.


Anything except complexity


Luckily for me, acquiring competent network security for my school district would have involved getting Newt Gingrich voters and John Lewis voters to both agree on a tax increase to better fund the district’s IT services. With the former being against any tax, and the latter preferring that money be spent on ensuring better outcomes for students (rather than preventing students from seeing arbitrary websites), this would have been impossible. And that is Ops Lesson #79:

Real change can only be implemented above layer 7 (in the political layer).


A coworker used to say that a backup tape is like a light bulb…you never know when it’s going to burn out.

Turns out they also used to disable failing jobs so they wouldn’t show up on the failed-jobs report but…they weren’t wrong about tapes.


“it’s all on the tapes.”

“then let’s pull it off the tapes”

sounds of 3 server admins sucking their teeth and shooting each other nervous looks


So, to sum up - two words: “Murphy’s Law”

1 Like

Or when a Jr. programmer (yours truly) is let loose inside a production database and proceeds to accidentally wipe out 350 Gb of historical customer data.

Frantic calls to the DBA then reveals that 6 months of backup tapes are subsequently found to be non-existent.

Many Shuvs and Zuuls knew what it was to be roasted in the depths of the Slor that day, I can tell you!

  1. Serverless isn’t.

Amen for that. Somebody had to say it.


I hate tape backups with a passion. Specifically, I hate being responsible for them.

I was contracting for a guy that setup systems for smallish businesses around the bay area, and probably the worst was when the top guy in the company had the drive in his laptop crash. That was when we found the tapes (that an admin was responsible for switching out) backing up their shared files were mixed up and the most recent good one was a week old.

Once I got into programming, we had someone do the same thing you described - he ran a DELETE query with no WHERE clause. We ended up buying software that sifted through the tx logs to recover most of the data. This is why all my hand-rolled queries now look like this:



SELECT stuff FROM some_table;

When I’m satisfied I’m not destroying something, then I commit.


SQL gives me the horrors. There should be a setting on the database to prevent expressions like this.


Over the years I’ve become a firm believer in policies that keep developers away from production anything. It’s not always possible with what the customer wants, but if I can get it, I will.

  1. If nothing else, it can serve as a bad example.

There is:

Step 0) pg_dump (or whatever the equivalent is for your sub-optimal database engine :slight_smile: )
Step 1) (do stuff)
Step 2) (realize you broke stuff)
Step 3) (restore database from the dump you made before you did steps 1 and 2)

Anything else is playing with matches at the gas station…

(Note: if your database is too big to conveniently do this, or load the whole dump into vi and edit it when needed, you have, literally, a bigger problem.)

1 Like

So, it turned out the off-site backup tapes were blank – and not just blank, but blank–blank: unformatted. Investigation revealed that the admin had, every day after work, put the tape into his backpack, got on the bus, gone to the back of the bus, tossed his backpack onto the shelf at the back of the bus – directly above the bus engine – and traveled to the storage site to drop off the tape.

As they do, the bus engine graciously hard-erased each tape during the journey.

  1. Not every problem has a solution.