AI crawlers run up a website's hosting bill

Originally published at: https://boingboing.net/2024/07/29/ai-crawlers-run-up-a-websites-hosting-bill.html

5 Likes

Block all the sites. There’s nothing to be gained from being ingested by a LLM. They just plagiarise you without credit at best.

What’s worse than that? Crediting you next to some absolute bullshit it pulls out of its incontinent arse.

17 Likes

As Rob says, some crawlers are necessary for search engine indexing. We need a better version of robots.txt to exclude unauthorised LLM crawlers.

13 Likes

There are a lot more web crawlers these days than there used to be.

11 Likes

It’s still up to the user-agent to respect robots.txt (or not). And given the ethical track record of AI companies, do we really expect them to heed any limitations specified in the file?

16 Likes

Easier said than done.

  • You can use robots.txt, but only well-behaved robots will respect that
  • You can block IP ranges, which will only work with crawlers you know of, for a while.
  • You can block domains. Same problem.

What I’ve done is set pretty strict throttles on traffic. If you hit more than 60 pages in a minute, you’re kicked out for 6 hours.

26 Likes
7 Likes

I never thought I’d miss the days when crypto bros were the loudest assholes in tech…

10 Likes

AI crawlers are acting in a way that is not respectful to the sites they are crawling,

Quelle surprise!

6 Likes

The report pretends that AI companies aren’t deliberately morphing their crawler names to get around blocks.

9 Likes

shocked philip j fry GIF

1 Like

Looks like robots.txt is being circumvented left right and centre as AI companies develop new crawlers (which they’re totally not doing to get around robots.txt you understand [wink]).

In Anthropic’s case, the robots.txt files of some popular websites, including Reuters.com and the Condé Nast family of websites, are blocking two AI scraper bots called “ANTHROPIC-AI” and “CLAUDE-WEB,” which are bots that were once owned by Anthropic and used by its Claude AI chatbot. But Anthropic’s current and active crawler is called “CLAUDEBOT.” Neither Reuters nor Condé Nast, for example, blocks CLAUDEBOT. This means that these websites—and hundreds of others who have copy pasted old blocker lists—are not actually blocking Anthropic.

5 Likes

No, Hundreds of sites put current Anthropic scrapers on their blocklist, and then Anthropic changed their agent string. :roll_eyes:

It would be interesting to survey major sites’ robots.txt to see if there’s a particular blocking threshold that triggers Anthropic’s morphs.

6 Likes

I’d be surprised if you can opt out of Google’s LLM without also opting out of its search index.

4 Likes

Seems like recognising and blocking AI web scrapers is something a machine learning algorithm could do very effectively, particularly if a bunch of websites shared their training data…

1 Like

I get a toggle-flipper scraper that uses a random agent string (Win95!), and comes in from a whole pile of IP addresses. (I should check if it’s using TOR.) It gives itself away by always trying port 80 first, HTTP/1.1…

I plan to spend some time analyzing it, but that’ll have to wait for the other side of moving day in a month.

3 Likes

I use Cloudflare for some of my domains and they’ve introduced an option to block AI bots. I’ve not tried it yet because I haven’t had any issues with AI crawlers yet but I wouldn’t mind blocking them regardless.

2 Likes

It was enabled for sites belonging to all of my clients within a week of the announcement. Putting aside potential bandwidth costs, no-one wants their content stolen.

3 Likes

At least that’s a good thing about TOR - public exit nodes are public knowledge. “Point at this url and update this ban list” is easy sauce.

Good luck with the move.

4 Likes

It’s too bad we can’t give robots.txt the force of law, or even, the force of a terms of service agreement.

3 Likes