Rogers outage in Canada. Demonstration of a non-resilient system

Rogers (one of the major Internet providers in Canada) is having major issues today. It is also demonstrating the fragility of our infrastructure.

3 Likes

I called Rogers, but apparently they had already turned their system off and turned it back on again.

Look, I know it’s a stressful time for them, but they didn’t need to be that rude.

5 Likes

In the chaos, some return to the old ways.

3 Likes

Having had the…pleasure…of being involved in much smaller scale cases of “the redundancy you thought you had has just turned into complexity for the root cause analysis and fix to restore a business critical system that The Business wants back up 10 minutes ago”(thankfully none of them were ones I designed, so I got a pat on the head for helping out rather than a pink slip) I would like to express my empathy for the techs that are no doubt having the time from hell right now.

That out of the way; I would also express my sadness and disappointment at the fact that things that are so critical(and are certainly represented as being trustworthy infrastructure) have proven to be so brittle.

Say what you want about the constant, looming, threat of global thermonuclear annihilation; but at least it did inspire some thinking about network resilience and redundancy among team ARPANET that today’s focus on cost-optimization just doesn’t hammer home quite the same way.

1 Like

I’d be interested to know how many of the ‘old’ systems are actually old system far enough back to HQ to be useful(probably at least some, since surviving payphones most likely survive due to neglect rather than ongoing upgrades); vs. how many look like they stepped out of another era but are actually maybe 100 meters of copper back to a utility box with a fiber converter that has 10 minutes of battery backup; or just an LTE-to-copper converter box that lets you use your beloved rotary phone just like the old days; but which will fall over and die exactly as fast as your cellphone; because it is one.

I picked Verizon’s example specifically because, after Hurricane Sandy destroyed a bunch of copper infrastructure they pushed to just not replace it and simply offer fixed location cellular to anyone looking for classic PSTN in that area; with the attendant reduction in capital costs and change in regulatory framework(mostly in their favor).

Obviously that anecdote isn’t directly relevant to the status of these payphones; but I suspect that the same incentives are at play: telcos are happy enough to milk legacy copper; but they don’t much like maintaining it; and (depending on their spectrum ownership and competitive situation) generally prefer to either try to ditch it entirely and just sell cell service; or upgrade it to fiber(absolutely better performance; typically a major regression in terms of levels of battery backup versus old-school copper and giant telco office battery farms).

2 Likes

Canadian ISP Rogers falls over for hours, takes out broadband, cable, cellphones

1 Like

A good observation:

On the other hand, there was a couple days of quiet when Russia blocked western social media.

8 Likes
7 Likes

I went and had coffee on Friday at a cafe and it was after the peak chaos. The manager was going through the point of sale records with the staff and it was a mix of “they will be back later with cash” and “they gave up and left” and “they ended up having enough cash”. It sounded like they were not as magnanimous with extending credit as the two above and were limiting it to regulars. However the accounting for that morning sounds like it will involve a bunch of hand waving and saying “close enough “.

For those that did not see: the blame for the outage is being placed on maintenance…. Again (happened 15months ago as well).

3 Likes

I was on a road trip that day and it was chaos. You couldn’t buy anything on debit, ATMs were down, and a lot of gas pumps were down. I think we all didn’t realize how much back end financial infrastructure depended on the Rogers cell network. Luckily credit cards are on a different system and those still worked everywhere.

CBC One did a call in show about the issue that was insightful. IT infrastructure experts called in to say that the issue was that Rogers didn’t have a rollback plan for the huge upgrade they were doing. That’s IT 101: before doing an upgrade (especially a big one) make sure you can roll back instantly to the old system if something goes wrong. This is what Rogers seemingly failed to do, and this was a huge upgrade they were doing.

The other really good point raised is that this should put a bullet in the attempted merger of Rogers and Shaw. The infrastructure is clearly already too concentrated. It’ll get even more fragile and expensive if we let the big three telcos consolidate further.

7 Likes

I guessed an Update From Hell as the most likely. I think also that they’ve centralized the access to the border routers so that a surviving segment can’t skip over to the outside Internet without going through their main (dead) hub. There might be security reasons for that, but it seems more like business logic beating out good technical sense.

6 Likes

Somewhat unlikely. I mean, that never happens. Anywhere. Ever.

3 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.