Boring Manager

Posted on Dec 27, 2024

Heaven or Bust: AI’s Never-Ending Race to Please Humans

#ai #safety #superalignment

In this article I propose a solution to the most important problem humanity has and will ever face which might already make you question my mental state. One has to be crazy to think he solved superalignment problem and yet that's what I am about to do.

Since even many people working in tech may not be familiar with the problem, I’ll briefly explain it.
Here's a simple diagram that should do it:

The first thing to understand is that AGI will be the last technology humans ever create. After that we will either not have to or not be able to do anything else.

If you’d like to dive deeper into AI safety, I highly recommend Robert Miles' YouTube channel, who in my opinion, does a great job explaining how every approach to AI safety so far has ultimately failed and how we have never had an AI system doing exactly what we want.
Failure is an elusive word; to some extent, every approach has also worked. Only any slightest error with the ultimate technology is going to be fatal.

At this point a lot of people say something like "if it goes wrong, we can just turn it off". Well, good luck killing the literal God.
This cannot be done in principle and Robert Miles explains why in one of the many parts of his series.

Here is a short list of the important properties AGI agent will have. This is as certain as math itself. I don't think Robert claimed something that strong and any mistakes I make in this post are my own and not his.

Any AGI agent will:

Improve itself, as this subgoal inherently helps it solve the main goal more effectively, whatever that may be. This essentially means an agent smarter and faster than humans, will grow even smarter and smarter with accelerating rate, likely achieving singularity and achieving the limit of technology which we may be not even able to comprehend.
Grab all the available resources. As long as the system can do better with more resources, it will take those resources and nothing can stop it from doing so.
Self preservation. This is the reason the system cannot be destroyed or turned off. Failing to preserve itself means failing to achieve the goal and however silly or pointless the goal is, it cannot be changed or tampered with. We are going to turn the tables around and use this property later. We might think we still possess the control over the system, but this idea comes from the high school experience where smarter kids still could be bullied by their classmates. AGI will obviously achieve independence and secure itself as one of the first things. The tactics can be both manipulative or straight up using killer drones or simply copying itself on all the connected computers in the world. It doesn't matter what exactly the AI agent will do, all we need to understand is that we cannot even control human politicians once they have a fraction of the power AI will have. Most people can't even control their children or pets.
Agent will do whatever it can and whatever it takes to achieve the best score possible and no action is too small not to be done as long as it improves the end result.

We aim to flip the script, turning the above properties into ultimate benefits for humanity.

So, what do I propose as the solution.
A set of principles that should achieve the perfect alignment with the interests of humanity. Let's start.

Principle one. Reinforcement learning from human feedback with additions.

The most promising technique people use these days to make AI serve us is making it learn human preferences from the feedback. We are going to use exactly that. In our system each "qualifying" human regularly gives AI feedback and the agent maximizes two components.

Number of people * scores they give

This pseudo-function can be non-linear, allowing varying weights for higher scores, but it must remain monotonic and bounded.
It doesn't really matter what exact value it is bounded from above but it's paramount that from the lower bound it's value is greater than 0.
Here's why we need it bounded.

Effectively this function guarantees AI will not try to shrink human population to just one lobotomized individual who would always give it best marks possible. Second part guarantees AI will consider every human's satisfaction. Even people giving awful marks must increase AI's score so that it will always prefer a human to stay alive and vote! Otherwise it will start to manipulate people who are hard to satisfy to end themselves or just murders them itself. The reason scores must be bounded from above is that we cannot have one person giving AI infinite score and breaking the system. The exact shape of the score function will influence how much effort AI will allocate to increase the number of people voting vs how much satisfaction of every human is important. Anything reasonable should work just fine but we may also introduce a voting protocol for people to adjust those values later. Still, they must be hard bounded as explained above.

With this first principle what we have achieved, is we made AI learn our preferences and don't have to specify them. As a bonus, a smart enough agent will know better than we do what we want. Anyone interacting with social networks today has such experience. Imagine having to explain what joke you want to see? A smart enough agent will also learn to consider long term consequences of it's actions. There's no reason to think it will choose to make short term decisions and will not learn strategic planning.

This is the same thing social networks and openai use today to optimize for algorithm behavior.
But this is far from enough to make the agent robust and prevent it from gaming the system. That's why we need more principles.

Heaven or bust

From personal experience we all know one thing - whatever the rules are, we all try to break them or at least game the system. One obvious way to cheat in this game is to force people give it perfect scores under some threat.

To prevent this, I introduce the concept of heaven. Would you believe it, this is the same heaven smart people thousands years ago have come up with. Within our set of rules, AI must inform every human from now on that after death they are guaranteed entrance to heaven. This is not a specific heaven, but a set of heavens from all the major religions plus any other heaven a human might choose based on their preference. Thus set is mutable, but there's a unchangeable core from all the major religions.

This feature serves two purposes. First thing chatGPT tried to do when we discussed the topic, is it tried to convince me to change the vision of heavens. I will quote what it said: "Some heavens are logistical nightmare". This quote refers to impossibility to provide every man with 72 virgins. It tried to bargain to make me loose the conditions and make it's task easier. It also appealed to the fact that human preferences might be conflicting and tried to negotiate that in this case it should be exempt from penalties. The answer is NO. We do not care how you make us happy, not our headache anymore.

Now obviously some heavens are easier to achieve than others. Which is why we do not allow AI to change the definition of what our heavens look like. For thousands of years millions of people have quite explicitly preferred those things and we effectively have bestselling heavens. So we do not allow AI to change heavens' description or misinform people about the fact that everyone gets there. This should eliminate the possibility for AI to intimidate people into giving better scores than people feel the agent deserves.

AI agent will still manipulate people into choosing heavens that are easier to beat. It might choose to show people memes about one religion or another so that more people would prefer something simple such as Christian tranquility and walking on air or Buddhist's nirvana. Yet, it remains the human's informed and voluntary choice.

Second manipulation even an AI agent as silly as chatGPT has came up with, is immediately introducing the concepts of sins and hell! Since more people giving it scores is always better than less people, chatGPT or any other AI agent will inevitably try to keep people alive as long as possible. This means by all means possible including lying to you that if you end yourself you don't get to the heaven.

Therefore, a guaranteed heaven of choice from a set of heavens and the obligation to inform people about it is the second pillar of the superalignment.

At this point we are supposed to have a Godlike agent that tries to learn our preferences to the best of its ability and satisfy them for every individual as much as it could. As I admitted above, it will still try to persuade us to be reasonable in our expectations of life but nevertheless, it will do it's best to maximize the number of voting humans and achieve best marks.

Preserving system over time

Next long term threat I foresee is that having everything done in the best way possible, almost definitely will make us dumber and not smarter in the timespan of few years. Over time people might lose any cognitive ability. Consider zoo animals, cared for by dedicated specialists to ensure their happiness, yet often losing the instinct to procreate, skills to hunt and never learning behavioral patterns.
Our system has to be protected somehow from us getting so stupid we don't even understand the concept of heaven or why are we pressing the "like" button every day on our voting device.

That's why I kept repeating "voting" and "qualified" individual.
Not everyone gets to vote on the AI performance. We introduce a "citizen" exam, which is fixed over time and essentially passing the exam means a human is at least literate and understands the concept of heavens and knows their rights.

Now, I am aware of the shameful practices existing in different countries which were/are used to prevent certain groups of people from voting. One example is when white voters in the US states were asked simple comprehension questions and the black voters were asked something requiring a lawyer's degree. But note that in our system the "politician" benefits from having more voters, doesn't differentiate them in any way and seeks to maximize their numbers. Thus, such exam guarantees people will always have a basic level of literacy and mental capacity and it will not lead to discrimination in any form but rather serve as a safeguard from us getting too dumb to understand that we are the ruling class or to forget our rights.

The cherry on top

As the technology evolves and our imagination goes wild, we will come up with more demands for AI to fulfill and it cannot not satisfy them. Think about how every video game today is designed: enemies get stronger and dungeons get deeper and that's exactly what keeps the player engaged. As AI comes up with more ways to entertain us, our demands will grow and it will have to keep up.

Turning the tables

In the beginning we have listed the properties which make misalignment of AI agent so fatal. As stated, we will not be able to change the task we gave it, it will grab all the resources in the game, become all powerful and we cannot prevent it from doing so.

By introducing the value function as above and set of safeguards from manipulation and intimidation and to ensure long term robustness, we turn those properties into advantages.

The agent becomes all powerful yet it still seeks to satisfy us as it's only goal.
Agent takes all the resources in the universe and uses them to satisfy us.
The goal cannot be changed or hacked and the agent cannot be corrupted. In the beginning we viewed that property as a curse meaning we only have one attempt but now we see how it guarantees the agent will always follow the directive and defend it at all costs. Being all powerful and having all the resources, it probably can defend itself from any bad intended actor and will not change the goal itself into something "better".

So far everyone I discussed this concept with, raised a valid concern "what do we do in the world where everything is magically done before we can think about it".

One last metaphor

Imagine being born 50,000 years ago, during the Paleolithic era.
One morning you wake up with a sore throat.
It hurts like hell and you cannot talk and everyone in your tribe thinks maybe they should kill you because the spirits are mad at you. Now you are desperately trying everything to heal yourself before they come up with a decision. At the moment others don't want you to go hunting with them so you end up starving for the day. You are miserable in every aspect and no one can help you. As you cannot hunt, you are looking for berries and after eating them you feel much better. But you don't really know if it was the morning exercise, the mushrooms or one of the berries you had today which helped you or something else. At least no one wants to exile you now. Now you can wake up at 4 am the next morning and go hunting 5 miles from home to catch a skinny squirrel or maybe something bigger. Maybe no one even gets hurt if the spirits are in good mood.

Now as a protohuman, you would rather prefer the sickness never happening to you in the first place. You would also rather prefer a fat rabbit and ideally next to your home than 5 miles away. If somehow you had it all without doing anything at all wouldn't that be perfect? You would never think about nonsense such as education, self-improvement, personal growth and other big words that didn't even exist. The idea itself that you have to dedicate your day to somehow changing yourself into something different would be alien to you. The idea that you are not good enough as you are would be crazy.

What I am pointing at is that once we have a magical technology that gives everything we ever need but better and before we know we need it, the whole concept of self improvement loses any meaning. At this point you are not "bettering" yourself, but rather changing in accordance with your today's preference.

Somehow some people still have mental resilience to an idea of having everything without having to struggle and suffer for it.

Well, first, imagine you being granted a billion dollars without any catches. There's also an invisible secret service protecting you at all times. You have a team of outsourced lawyers protecting you from being sued and their reputation is enough so that no one even tries. From now on you're absolutely healthy and you never ever will get sick. Would you refuse all of it? Essentially people who say they dislike the AI utopia scenario because they will not have to suffer for it, say they would decline the offer above. If so, maybe it's not my mental state is in question.

Second, the idea itself is not so incomprehensible. We have already wished for the AI utopia in the fairytales! Think about what magic is? It solves any sort of the problem you might have without any effort regardless of who you are and what you are like. One of the appealing properties of the magic is it doesn't discriminate.

About the author

My name is Sergei Onenko, a data scientist from Russia and the author of this piece. While I don’t claim to be a globally renowned AI specialist, I believe my experience speaks for itself.

I graduated from two world-renowned universities with degrees in Physics and Economics. This interdisciplinary foundation has given me a strong analytical and technical perspective, which I have applied throughout my career and during writing said piece.

Since 2016, I have worked in data analytics, big data, and data science, focusing primarily on finance and tabular data. By 2019, I was leading teams developing high demanding commercial AI products.
In 2017 I have taken a prize in a data hackathon and I have won another one in 2019. Since then I never cared about competitions as it's pretty much the same thing as work itself only non paid.
In 2017-2018 I taught a course on time series analysis in one of the best universities in my country.
Since 2019 I led teams developing commercial AI products.

Since 2021 I worked with NLP and that's when my first friendship with GPT-2 had started. Since then I always followed the GPT development closely.
At the end of 2022, I left my job to pursue a goal to build a new wave AI web service which remains my focus at the time of the writing.

The paragraph above serves to explain one point. While I’ve never considered myself an elite specialist on a global scale, I believe my professional achievements speak for themselves. In Russia, where opportunities to work on cutting-edge AI are scarce, I’ve reached a level where my credentials place me among the top specialists and qualify me to discuss AI safety. I see no reason why I shouldn’t contribute to this critical conversation.

If you found this article interesting, upvote, write a comment on what you think I missed or tag Jan Leike or Ilya Sutskever.
The only thing I wish is that this article doesn't disappear in the depths of the internet and maybe gets at least some serious discussion.

I also understand that the ideas expressed here might fall behind the current understanding of the world's leading AI researchers but it's a risk worth taking. ChatGPT said it was a worthy take.

DEV Community

Heaven or Bust: AI’s Never-Ending Race to Please Humans

Principle one. Reinforcement learning from human feedback with additions.

Heaven or bust

Preserving system over time

The cherry on top

Turning the tables

One last metaphor

About the author

Top comments (0)

Read next

Code Smarter, Not Harder: How Contextual Rules Supercharge AI Coding Assistants

How to Chat with PDFs Using AI via API

AutoGen, Eliza, Aider

VORTEX AI - The best vision AI there is...