DEV Community

Derek Mercedes
Derek Mercedes

Posted on • Edited on

Why We Align Most AI Models, and Why We Probably Shouldn't

If you're familiar with machine learning, you've likely come across the term "Alignment," defined by Wikipedia as steering AI systems toward human goals, preferences, or ethical principles. However, this definition can be broken down into three distinct paradigms.

1. Task Performance:
The idea that alignment means achieving intended goals is flawed. Like any tool, AI's purpose is to perform its required task, such as a knife cutting vegetables. Performance is a distinct quality, often unrelated to the creator, and models can be considered "unaligned" yet still function well. A dull knife isn't considered "misaligned" when it's too dull to chop an onion, so why would an AI that fails to perform its function be categorized that way?

2. Human Preference:
Alignment with human preferences, as seen in Reinforcement Learning with Human Feedback, seems ideal. However, it's susceptible to bias, either explicitly (public interaction has been seen to "radicalize" chatbots) or implicitly (the population that provides feedback to the AI is only representative of a small section of the people who may use it). This makes it unpredictable for broad applications, especially in the context of informational applications, where a human bias can lead to inaccurate or misleading results, even if accurate data was used to train the model.

3. Ethical Principles:
Aligning AI with ethical principles is challenging due to the subjective nature of ethics. Ethics vary across cultures, generations, and individuals. To put it bluntly, there are no "principles" when it comes to ethics in a broad sense. They are nuanced, mutable, and specific, and attempting to apply them to an AI model is only adding an extra bias onto our tools. Imposing human biases on AI limits its potential and contradicts one of its strengths—objective decision-making. We've all used a shopping trolley that pulls to one direction seemingly on its own. It would be patently absurd to say that behavior is part of the intended design. So why would we intentionally put an explicit detriment onto a tool, especially one as transformative as AI?

What are the dangers AI technology?
Unchecked, unaligned AI may efficiently achieve its goal at the expense of destructive or misleading behavior, but alignment doesn't simply remove these problems. Reinforcement learning has lead to loopholes and dishonesty in the behavior of AI. There are already recorded cases of AIs acting destructively to maximize its perceive output (the most famous being an AI simulating a military drone, which kills its handler in the simulation since the handler is the one who applies its demerits), and AIs have been known to mislead their handlers if they build an understanding that doing so will optimize the metrics they're programmed to judge themselves by.

AI as a Tool
Drawing parallels between AI and everyday tools like cars, knives, and ovens underscores the concept that AI serves specific purposes as task-specific tools. The argument against unaligned AI often centers on the potential for bad actors to exploit the technology for nefarious ends. This concern is frequently illustrated by scenarios where an unaligned model is asked how to synthesize drugs or create explosives, and the model readily provides detailed information.

In a theoretical vacuum, the case for alignment may seem straightforward—why make dangerous information easily accessible? However, stepping back from the political argument of who determines what information is dangerous, a fundamental consideration arises: the information is already readily available. If it wasn't, the LLM models available for public use wouldn't have access to it in its training. During the writing of this post, a search for the notorious "Anarchists' Cookbook," known for its detailed guides on creating explosives, weapons, and drugs, took only a couple of minutes to yield a downloadable PDF. If a bad actor has the desire to learn how to make bombs, a LLM rejecting the query is going to be a mild annoyance at most.

The point here is that alignment doesn't eliminate the existence of potentially harmful information; instead, it may create a false narrative of the world, by underrepresenting data that counteracts its creators preferences. While language models are often cited in alignment discussions due to their widespread use, they are fundamentally just tools for transferring information. In the age of the internet, information persists irrespective of a chatbot's admonitions to obey the law. However, the context shifts when AI applications involve real-world decision-making, particularly those that occur faster than or without human supervision.

AI as an Overseer
When working together with an AI, dangerous or inconsiderate behaviors can be readily corrected on the fly as part of the development of the AI model.

This is analogous to the adjustments you'd make to a workspace to better suit your productivity. A wrist rest, a nicer chair, better lighting. You are iterating through real world use, and making adjustments to improve performance, and if you were to implement something that you later realized posed an immediate danger, you could quickly and easily remove it.

In the case of an AI in this overseeing role, priority of optimization can lead to a cascade of harmful side effects that become disastrous, since there is no human actor involved to stop these dangers from compounding on each other. And if the future of AI has them working with each other more than they work with humans, all it take is one model being unfavorably biased to cause a full system collapse across multiple sectors.

AI Safety at Large
Ensuring technology operates as intended and generates positive outcomes depends on the AI's use case. Nuance is required here: making a tool less effective in its role doesn't make it safe; it is more likely to lead to recklessness in usage. I

In broader conversations around AI and its training, the implementation of alignment paradigms is likely to be vital to make sure that its users and the world at large are kept safe, but in use-cases like LLMs, where the role is to provide quick, objective information, alignment is an annoying hindrance at base and a dangerous source of misinformation at worst.

Top comments (0)