AI crawlers run up a website's hosting bill

jlw · July 29, 2024, 4:59pm

Originally published at: https://boingboing.net/2024/07/29/ai-crawlers-run-up-a-websites-hosting-bill.html

…

robertmckenna · July 29, 2024, 5:09pm

Block all the sites. There’s nothing to be gained from being ingested by a LLM. They just plagiarise you without credit at best.

What’s worse than that? Crediting you next to some absolute bullshit it pulls out of its incontinent arse.

gracchus · July 29, 2024, 5:36pm

As Rob says, some crawlers are necessary for search engine indexing. We need a better version of robots.txt to exclude unauthorised LLM crawlers.

Brainspore · July 29, 2024, 5:45pm

There are a lot more web crawlers these days than there used to be.

wazroth · July 29, 2024, 5:49pm

It’s still up to the user-agent to respect robots.txt (or not). And given the ethical track record of AI companies, do we really expect them to heed any limitations specified in the file?

adamrice · July 29, 2024, 6:33pm

Easier said than done.

You can use robots.txt, but only well-behaved robots will respect that
You can block IP ranges, which will only work with crawlers you know of, for a while.
You can block domains. Same problem.

What I’ve done is set pretty strict throttles on traffic. If you hit more than 60 pages in a minute, you’re kicked out for 6 hours.

sqlrob · July 29, 2024, 6:42pm

fuzzyfungus · July 29, 2024, 6:59pm

I never thought I’d miss the days when crypto bros were the loudest assholes in tech…

euansmith · July 29, 2024, 8:34pm

AI crawlers are acting in a way that is not respectful to the sites they are crawling,

Quelle surprise!

RickMycroft · July 29, 2024, 8:40pm

The report pretends that AI companies aren’t deliberately morphing their crawler names to get around blocks.

Amstrad · July 29, 2024, 10:04pm

shocked philip j fry GIF

MikeR · July 29, 2024, 10:23pm

Looks like robots.txt is being circumvented left right and centre as AI companies develop new crawlers (which they’re totally not doing to get around robots.txt you understand [wink]).

In Anthropic’s case, the robots.txt files of some popular websites, including Reuters.com and the Condé Nast family of websites, are blocking two AI scraper bots called “ANTHROPIC-AI” and “CLAUDE-WEB,” which are bots that were once owned by Anthropic and used by its Claude AI chatbot. But Anthropic’s current and active crawler is called “CLAUDEBOT.” Neither Reuters nor Condé Nast, for example, blocks CLAUDEBOT. This means that these websites—and hundreds of others who have copy pasted old blocker lists—are not actually blocking Anthropic.

RickMycroft · July 29, 2024, 11:50pm

No, Hundreds of sites put current Anthropic scrapers on their blocklist, and then Anthropic changed their agent string.

It would be interesting to survey major sites’ robots.txt to see if there’s a particular blocking threshold that triggers Anthropic’s morphs.

zoidberg · July 30, 2024, 2:37am

I’d be surprised if you can opt out of Google’s LLM without also opting out of its search index.

jamesfcarter · July 30, 2024, 9:18am

Seems like recognising and blocking AI web scrapers is something a machine learning algorithm could do very effectively, particularly if a bunch of websites shared their training data…

RickMycroft · July 30, 2024, 12:39pm

I get a toggle-flipper scraper that uses a random agent string (Win95!), and comes in from a whole pile of IP addresses. (I should check if it’s using TOR.) It gives itself away by always trying port 80 first, HTTP/1.1…

I plan to spend some time analyzing it, but that’ll have to wait for the other side of moving day in a month.

McGreens · July 30, 2024, 1:27pm

I use Cloudflare for some of my domains and they’ve introduced an option to block AI bots. I’ve not tried it yet because I haven’t had any issues with AI crawlers yet but I wouldn’t mind blocking them regardless.

gracchus · July 30, 2024, 1:45pm

It was enabled for sites belonging to all of my clients within a week of the announcement. Putting aside potential bandwidth costs, no-one wants their content stolen.

jimr1603 · July 30, 2024, 1:50pm

At least that’s a good thing about TOR - public exit nodes are public knowledge. “Point at this url and update this ban list” is easy sauce.

Good luck with the move.

killick · July 30, 2024, 2:19pm

It’s too bad we can’t give robots.txt the force of law, or even, the force of a terms of service agreement.

Topic		Replies	Views
Internet Archive to ignore robots.txt directives boing	23	4677	April 27, 2017
NYT vs wget: technologically illiterate Snowden coverage boing	22	3791	February 15, 2014
Verizon wants to slow down your favorite websites unless they pay bribes boing	54	5752	October 1, 2013
Porn companies carpetbomb Google with sloppy takedowns, remove tons of Github projects boing	18	3319	January 13, 2015
Verizon support rep admits anti-Netflix throttling boing	66	5459	February 12, 2014

AI crawlers run up a website's hosting bill

Related topics