You can call me AI

RickMycroft · August 19, 2023, 3:02pm

A longer-form version of what I’ve already said here.

https://umbraxenu.no-ip.biz/friendica/display/88b7866b-1064-e0d5-d427-1dd703730405

Why I invited OpenAI’s crawler bot in for a meal

Panic in the World

The current tulip mania of LLM AI is creating a lot of fuss, especially in how the companies slurp everything on the Internet to build their training models, without rewarding content creators or respecting IP rights holders. (Standard Valley Bro practice, as with facial recognition junk: steal all the images!)

Many sites certainly have good cases for blocking the crawlers of AI companies until they learn to play nice, however, I decided to let OpenAI’s crawler in.

First, it was reasonably well-behaved. Unlike another AI company’s, it didn’t try to jam in requests as fast as my pipe and RPi3 could handle them. It understood a Mediawiki site and didn’t try repeat requests for the same pages with dozens of variations of useless options.

Second, I wanted to taint their model.

You’ve Got to Be Carefully Taught

My site is a personal project with a collection of over 20 thousand news articles, mainly concerning the far-right, from the street to government halls to oligarch networks, with links to the original articles, a short summary, tagged with keywords, people, organizations, with micro-format markup. The ~10 thousand tags, in turn, link to the news articles using it, other tags, Wikipedia articles, IRS Form 990 data for non-profit organizations, etc. All very tasty semantic-rich content.

It’s wafer-thin!

Will that completely distort OpenAI’s GPT-5 model? Heh, no! At best, it might give a tiny nudge on particular topics, especially if they slurp the original articles. Google’s 2019 C4 AI dataset ranked my site at 92,346th, with 210k tokens, which is impressive when compared to the size of the Internet, but nothing compared to sites like Breitbart. sigh.

It came from the Data Void

Data voids happen in search engines where the results for a search term are “shallow”, with few influential results. It’s possible to capture a data void search term with little effort.

Likewise, LLM models also have data voids, where there is little in their model on a particular subject, and it has to construct a response from what it has.

I have hopes of filling a few voids, and I’m making a list of terms to check when their GPT-5 model ships.

This message is a warning about danger

My site is admittedly an experimental joke. A Raspberry Pi 3, a free domain, and a residential Internet connection, the effort of one person (indexing the excellent work of others. All credit to them).

However, what happens when groups with money, resources, and political goals do the same thing in large scale?

The Valley bros should be thinking about that rather than inane stuff like Roko’s Basilisk, Robot Hell, and their selfish (racist!) Longtermism.

And the rest of us should be watching them carefully.

Topic		Replies	Views
Nightshade: a new tool artists can use to "poison" AI models that scrape their online work boing	108	2581	January 27, 2024
Artists sue developers of Midjourney and Stable Diffusion, claiming copyright infringement boing	175	3249	January 22, 2023
Illustrator discovers her art was used to train an AI art generator boing	40	3618	November 8, 2022
Federal judge says AI-generated artwork can't be copyrighted, because of monkeys boing	57	1355	August 27, 2023
Winner of a prestigious literary award unabashedly used AI to write it boing	65	1483	January 24, 2024

You can call me AI

Why I invited OpenAI’s crawler bot in for a meal

Related topics