Microsoft 365 outage caused problems for Outlook.com and Microsoft Teams users

Originally published at: https://boingboing.net/2020/09/28/microsoft-365-outage-caused-problems-for-outlook-com-and-microsoft-teams-users.html

1 Like

It was more than that, it affected Azure AD auth, which is used by most of our applications for our thousands of users. They were all useless if we needed to reauth.

so wonderful for our IT staff to farm these features out instead of keeping them in-house, since Microsoft would clearly not make middle-of-the-day modifications to their infrastructure that risk everything going offline, then fail to roll them back effectively.

7 Likes

I say the same thing about online productivity software. I can work without email, but if I need connectivity beyond the local network to access or store data, that’s a problem waiting to happen.

Roll them back? I haven’t heard that phrase since my days as Queen of Quality Implementations. I was deposed by forces in favor of the Frequently Failing Forward Fix. :nerd_face: :wink:

3 Likes

3 Likes

In a delightful sprig of irony; authentication failures managed to knock out my access to the service health interface fast enough that I was unable to confirm that the problems people were bringing to my door were definitely on the remote end.

I continue to be baffled by why they insist on permission-gating advisories so that only people with administrative privilege for a given product can see them. There’s zero obvious gain vs. just posting it on a page with RSS like normal people.

All quotations from MO222965 “Service Degradation”:

6:09 PM “We’ve identified and are reverting a recent change to the service which may be causing or contributing to impact.”

(Could you be less specific? And thanks for a change that must have come somewhere in the 4-6 region where definitely nobody does any work; and especially nobody needs to re-authenticate after moving from office to home.)

6:47PM:“We’ve completed reverting the change which was likely causing impact and are monitoring the environment to ensure that services are recovering.”

6:56PM:“We’ve identified that reverting the recent change did not alleviate impact to Microsoft services as expected. We’re working to explore additional options for mitigation.”

(Well, I take back what I said about your idea of a maintenance window, I guess; pity your level of insight into the problem has declined over the past hour…)

8:00PM:“We’ve determined that a specific portion of our infrastructure is not processing authentication requests in a timely manner. We’re pursuing mitigation steps for this issue. In parallel, we’re rerouting traffic to alternate systems to provide further relief to the affected users.”

(Am I to take it that load balancing is an exigent measure rather than a commonplace? Or wonder why it’s two hours in and “some of the servers are timing out, let’s use different ones” just came up?)

8:54PM:“Our mitigation strategy was successful in allowing users to sign into the previously impacted services. Our internal monitoring has validated this recovery and we have received positive confirmation from customer reports.”

(“Hey, um, guys…All our feedback forms are only available to authenticated users; should we be worried about sample bias?” “Shut up Stats 101; in this house we use Big Data and Machine Learning and Azure Cognitive. Not a problem.”

10:02PM:We have confirmed via our monitoring that the majority of services have recovered for most customers. However, we continue to see a small subset of customers whose tenants are located in North America region who are still impacted. We’re now investigating mitigation steps for those customers who are still affected."

(Is a member of a small subset. Sighs wearily)

11:20PM: "Preliminary root cause: A portion of our infrastructure was not processing authentication requests in a timely manner and preventing users from being able to access multiple M365 services.

(That’s what an Epiphyte would call a root-cause; and I guess it’s a cause with roots; just one growing on top of an entire forest… Well, on to Tuesday…)

2 Likes

A wise fellow once noted that, contrary to popular belief, everyone has a test environment. It is people who have distinct production environments that are of note.

2 Likes

Well that’s Office 364 for you.

1 Like

Azure AD auth is great…when it doesn’t shit the bed of course.

This topic was automatically closed after 5 days. New replies are no longer allowed.