A longer-form version of what I’ve already said here.
Why I invited OpenAI’s crawler bot in for a meal
Panic in the World
The current tulip mania of LLM AI is creating a lot of fuss, especially in how the companies slurp everything on the Internet to build their training models, without rewarding content creators or respecting IP rights holders. (Standard Valley Bro practice, as with facial recognition junk: steal all the images!)
Many sites certainly have good cases for blocking the crawlers of AI companies until they learn to play nice, however, I decided to let OpenAI’s crawler in.
First, it was reasonably well-behaved. Unlike another AI company’s, it didn’t try to jam in requests as fast as my pipe and RPi3 could handle them. It understood a Mediawiki site and didn’t try repeat requests for the same pages with dozens of variations of useless options.
Second, I wanted to taint their model.
You’ve Got to Be Carefully Taught
My site is a personal project with a collection of over 20 thousand news articles, mainly concerning the far-right, from the street to government halls to oligarch networks, with links to the original articles, a short summary, tagged with keywords, people, organizations, with micro-format markup. The ~10 thousand tags, in turn, link to the news articles using it, other tags, Wikipedia articles, IRS Form 990 data for non-profit organizations, etc. All very tasty semantic-rich content.
It’s wafer-thin!
Will that completely distort OpenAI’s GPT-5 model? Heh, no! At best, it might give a tiny nudge on particular topics, especially if they slurp the original articles. Google’s 2019 C4 AI dataset ranked my site at 92,346th, with 210k tokens, which is impressive when compared to the size of the Internet, but nothing compared to sites like Breitbart. sigh.
It came from the Data Void
Data voids happen in search engines where the results for a search term are “shallow”, with few influential results. It’s possible to capture a data void search term with little effort.
Likewise, LLM models also have data voids, where there is little in their model on a particular subject, and it has to construct a response from what it has.
I have hopes of filling a few voids, and I’m making a list of terms to check when their GPT-5 model ships.
This message is a warning about danger
My site is admittedly an experimental joke. A Raspberry Pi 3, a free domain, and a residential Internet connection, the effort of one person (indexing the excellent work of others. All credit to them).
However, what happens when groups with money, resources, and political goals do the same thing in large scale?
The Valley bros should be thinking about that rather than inane stuff like Roko’s Basilisk, Robot Hell, and their selfish (racist!) Longtermism.
And the rest of us should be watching them carefully.