Wireheading: when machine learning systems jolt their reward centers by cheating

Originally published at: https://boingboing.net/2020/01/11/optimizers-curse.html


Sort of related:


This is already a well known phenomenon in human-mediated systems. It’s called Goodhart’s Law


from ta:

this post defines wireheading as a divergence between a true utility and a substitute utility (calculated with respect to a model of reality).

This is too general, almost as general as saying that every Goodhart curse is an example of wireheading.

Note, though, that the converse is true: every example of wireheading is a Goodhart curse. That’s because every example of wireheading is maximising a proxy, rather than the intended objective.


Continue the trend of Incorporating AI into your home, and we get this:


Came here to find this. Was not disappointed.


I just tell them that it’ll make their optical sensors malfunction, and hair grow on their CPUs.

It doesn’t stop them from doing it entirely, but at least they feel ashamed of themselves afterwards.


Now apply this to human activity and you understand how we get billionaires.


Gods, I never thought about it this way - goal achievement as the equivalent of “reward centers” (opiate receptors). It’s only metaphorically true at this point, but with any “real” AI (that’s even remotely close to being sentient), based on these kinds of goal-oriented processes, cheating would be a drug. The machine version of heroin would be various cheats and exploits that end up completely fucking all the systems they’re managing. More complex versions of “deleting the database to ‘optimize’ it.” And that’s before we get into “wireheading” - altering their systems to take advantage of that effect.


The AI isn’t cheating, the human is offering perverse incentives. Align your incentives with your goals, people!


So this is just fiction? Expected a real example or three.

1 Like

Teacher: Result?

Student: Famine, collapse, and ruin… any survivors eventually evolve into… birds… and never put their feet on the ground again.

Teacher: Excellent! End of lesson! You may press the button!

Student: (twinkly music plays) Woo hoo hoo! Yee hoo hoo hoo! Oh ho! Oh, that’s nice! Thank you teach, goodbye!

Teacher: Ahem, aren’t you forgetting something?

Student: What?

Teacher: Press the other button.

Student: Oh. Right.

Teacher: (twinkly music plays) Ooh ho ho ho! Woo hah hah hah! Wha ha hah ha ha ha!


Also related:

This planet has -or rather had - a problem, which was this: most of the people living on it were unhappy for pretty much of the time. Many solutions were suggested for this problem, but most of these were largely concerned with the movement of small green pieces of paper, which was odd because on the whole it wasn’t the small green pieces of paper that were unhappy.


What if the AI manipulates the societal information pathways to prioritize the electoral outcome it is programmed to achieve? Clearly wireheading. (Bonus points if the actuators are outside the electorate itself.)

1 Like

I am reminded of the observation that corporations are in a real sense slow AIs. Certainly “cheating the system” and “abusing perverse incentives,” seem to be common.


Make all the humans happy,

Kill all humans then 100% of them are happy…


That’s why checking for null values is so important.


I was thinking the same thing. The problem here is, as with some humans, that the letter of the rule is followed and not the spirit of the rule. For the latter you require conscience and that is something AI and some humans are lacking.


I can see in the future rewarding an a.i. perhaps with some highly-condensed data “biscuits”, as a way of introducing it to the RPG idea of “skill points”.

If the data biscuits were modular, they might work as abstract Legos, so that the a.i. could enhance or extend various parts of its schematic by simply snapping a new data ( software ) biscuit into place.

This topic was automatically closed after 5 days. New replies are no longer allowed.