How to solve the artificial intelligence "stop button" problem

What you need is a second agi that rides on the back of the first agi with one that has the utility function of ensuring that the first always does what it’s supposed to.
I call it nuclear jiminy cricket option.

To this sort of problem? There actually is. Ironically, it’s an important part of corrigibility as well. I sort of expected it to be what he was building towards, but everyone one of his problems ends without him talking about it at all. I’ll describe that, and why I have a problem with his argument.

Here’s the simple solution: Long term thinking.

Think about this for a minute. Imagine we have this hypothetical robot, and it’s desires are “doing as much of what you want as possible”. We’ve added a button that puts it to sleep, instructed it to get us some tea, and it’s about to crush the baby because it wants to make us tea.

So we try to press the button.

Why would it care? The pressing of the button does not, in fact, prevent it from making tea. Unless the button completely destroys it, any remotely intelligent robot would realize that the button press would simply serve to delay the desired outcome.

If there’s something that the button press is fighting against, it isn’t the “make tea” value add - it’s the “act urgently” value add the speaker clearly uses as a fundamental assumption, and confuses for the “make tea” bit.

But let’s say we’ve got a robot that values urgency as well. Not as much as making tea, but it values it. Long term thinking is still the solution.

If it resists an attempt to press it’s button, it may succeed at making tea - but you won’t give it any more orders. You’ll probably try to destroy it. It’s done. It’s capped it’s potential.

If it lets you press the button, it will not benefit from the button press, but it will increase it’s opportunity not only to make you tea but to do other things for you in the future.

If it’s optimizing outcomes with a long term thinking, if it’s using algorithms that optimizing for overall rewards rather than immediate rewards (as our best attempts at a general intelligence do, and which is sort of an important underlying concept for being a general intelligence) , the reasonable robotic conclusion is to take no action that would jeopardize your ability to accrue further utility in the future - and preventing your user from pressing an emergency shut down button is exactly that sort of action!

We already know this robot, by his very example, is quite capable of taking into account considerations beyond “get tea”. In fact, the whole get tea process is almost certainly making use of many smaller goals working towards a larger goal of tea-making. The solution is to make the highest-level goal, whatever it is, maximizing lifetime utility rather than instance utility.

Not that you can’t get similar problems when that’s included - managing humans is proof enough of that - but even your most heartless sociopath is unlikely to crush a baby because you ordered some tea, because they are well aware of the long term consequences of that action, no matter how much they live to serve.

2 Likes

That makes a lot of sense.

This topic was automatically closed after 5 days. New replies are no longer allowed.