Originally published at: Enjoy this gallery of memes celebrating Facebook and Instagram outages | Boing Boing
…
Seems like just a convenient coincidence that F@ckbook Insta/scam & Whatsapp all took a dump the day after a whistle blower dropped a bomb on Zucky. Just a coincidence…
The sweet sound of silence.
But all that valuable Covid-19 research has come to a standstill!
But the upside is that the bathroom stalls are empty and available for what they were intended.
Well, BGP can be a bear to get right and will really screw your routing up if you forget to export your routes or accidentally blackhole all your traffic, flap your routes to your neighbors, or just get the MSS wrong. Add in the complexities when you’re using modern virtual networking and distributed datacenter architectures and automated push updates, and you could be looking at some real downtime. Or maybe someone unplugged a router to plug in a vacuum cleaner again, and it’ll all be back up when they’re done cleaning the data center.
Obligatory
The only downside to Facebook being down is that I can’t gloat about it on Facebook.
That was my first thought, and about the only explanation that makes sense to me.
nslookup & dig mx give no results, so traceroute & ping won’t work either.
Nothing looks amiss in whois, so the registration hasn’t expired.
Twitter seems to be having problems too now.
Unexpected consequences: https://twitter.com/ay_meshkov/status/1445105673327587342
We just had a serious outage of AdGuard DNS, but it was actually caused by Facebook. What happened and how on earth AdGuard may depend on FB? Let me try to explain.
Everything started with Facebook name servers going down today. AdGuard DNS connects to them in order to find out the addresses of Facebook domains. So, they went down and now AdGuard DNS was responding with error to every request for FB domains’ addresses.
This caused a considerable spike in the overall number of requests. What happened? Every app, every device was now repeatedly requesting FB domains as if they can’t live without it.
The high number of requests is not much of a problem for us, we’re ready for higher load so this went almost unnoticed. So everything was working well until one crucial moment when Facebook engineers decided to null-route their nameservers.
What does this mean? From now on requests to FB name servers not just fail, they TIME OUT. Now we could not respond quickly with an error and have to wait for a few seconds until we’re sure there will be no response.
The worst part is that we weren’t doing any negative caching. It means that if we cannot resolve a domain, we were trying to do that again and again until it finally succeeds (it never did) instead of caching the negative result at least for a few seconds.
So we had an overwhelming number of incoming queries that time out and simply exhaust the servers resources. This all lead to one of the worst outages we ever had with AG DNS.
At some point we almost hit 1M queries per second (our normal load is about 250-300k). The most of the queries are encrypted (DoT/DoH/DoQ) so this is like 10x regular DNS load.
It took us about an hour to figure all that out, implement a fix (negative caching it is) and deploy it to every AG DNS server. Everything works well now, but we learned a couple of very useful lessons. Thanks, FB!