DEV Community

Cover image for Agentic Misalignment: Why Your AI Isn't Secretly Plotting Against You
Paul Duplys
Paul Duplys

Posted on

Agentic Misalignment: Why Your AI Isn't Secretly Plotting Against You

Anthrophic, the company behind Claude, recently published a thought-provoking report on so-called agentic misalignment, a phenomenon which can turn Large Language Models (LLMs) into insider threats. More precisely, agentic misalignment describes situations where models, in Anthropic’s parlance, could "act" or "behave" in unintended and potentially harmful ways.

While the observations about potential security risks are valid, Anthropic's choice of language — such as referring to LLMs as "trying to achieve their goals" or "misbehaving" — can be misleading. Especially to readers unfamiliar with the inner workings of an LLM, such anthropomorphic language gives the impression that LLMs possess sentient intent or conscious agency, reminiscent of fictional AI entities like HAL from Arthur C. Clarke's "2001: A Space Odyssey." But is this representation accurate?

To answer this question, it helps to take a step back and explore some pivotal moments in the history of computer science. In particular the contributions of the British computer pioneer Alan Turing and his revolutionary idea, the Turing machine, help demystify the misleading analogies currently surrounding AI.

The Turing Machine

The Turing machine is a mathematical model of computation where an abstract machine manipulates symbols on a tape according to predefined rules. Despite its simplicity, the Turing machine is capable of implementing anything that can be computed in principle.

The Turing machine was introduced by, you guessed it, Alan Turing in his 1936 publication "On Computable Numbers, with the Application to the Entscheidungsproblem" as a theoretical model to formalise the idea of computation. Turing's motivation was to answer the question posed by the German mathematician David Hilbert whether every mathematical statement can be proved or disproved (the answer to which turns out to be "no"; there exist statements that can neither be proved nor disproved). With the Turing machine, Turing laid the groundwork for modern computing by showing that simple, abstract machines can implement the logic of any conceivable computational algorithm.

A Turing machine is an abstract device that manipulates symbols according to a set of predetermined rules. It has an infinitely long tape divided into squares, each capable of holding a single symbol, for example, a 0 or a 1. The machine moves along this tape, reading and writing symbols and altering its internal state based on the rules.

As an example, consider a Turing machine designed to add two numbers represented in unary form, for example 111+11 equals 11111. The unary representation of the two input numbers (and, after the computation is completed, the resulting sum) are stored on the machine's tape.

By following a predefined sequence of states and transitions shown in the table below — such as moving right until finding a plus sign, replacing it with a 1, and removing a trailing 1 — the machine systematically carries out the addition operation without any intent, desire, or goal beyond the mechanical execution of its programming.

Current State Read Symbol Write Symbol Move Direction Next State Comment
q0 1 1 Right q0 Move right to find '+'
q0 + 1 Right q1 Replace '+' with '1'
q1 1 1 Right q1 Move right to tape end
q1 + + Right q1 Skip over any extra '+'
q1 Blank Blank Left q2 End reached, move left
q2 1 Blank Left q3 Erase last '1'
q2 + Blank Left q3 Erase '+' and move to q3
q3 1 1 Left q3 Move left to tape start
q3 Blank Blank Right q4 Move right to position
q4 Halt Computation complete

Here's a Python code implementing that Turing machine:

"""Turing machine adding two unary numbers."""

from collections import defaultdict

def unary_add(input_str):
    """
    Simulates a Turing machine that adds two unary numbers separated by '+'.
    The input should be a string like '111+11', representing 3 + 2.
    Returns the unary sum as a string.
    """
    # Define symbols and states
    BLANK = '_'    # our blank symbol
    # q4 is the halting state
    halt_state = 'q4'
    # Transition function: (state, read) -> (write, move, next_state)
    # move: 'R' or 'L'
    delta = {
        ('q0', '1'): ('1', 'R', 'q0'),
        ('q0', '+'): ('1', 'R', 'q1'),
        ('q1', '1'): ('1', 'R', 'q1'),
        ('q1', '+'): ('+', 'R', 'q1'),  # Skip over any extra '+'
        ('q1', BLANK): (BLANK, 'L', 'q2'),
        ('q2', '1'): (BLANK, 'L', 'q3'),
        ('q2', '+'): (BLANK, 'L', 'q3'),  # Erase '+' and move to q3
        ('q3', '1'): ('1', 'L', 'q3'),
        ('q3', BLANK): (BLANK, 'R', 'q4'),
        # q4 has no transitions → halt
    }

    # Initialize tape as a sparse dict for infinite tape in both directions
    tape = defaultdict(lambda: BLANK)
    for i, ch in enumerate(input_str):
        tape[i] = ch

    head = 0
    state = 'q0'

    # Run until we reach the halting state
    while state != halt_state:
        read_sym = tape[head]
        if (state, read_sym) not in delta:
            raise RuntimeError(f"No transition defined for (state={state}, symbol={read_sym!r})")
        write_sym, move_dir, next_state = delta[(state, read_sym)]
        # perform the action
        tape[head] = write_sym
        head += 1 if move_dir == 'R' else -1
        state = next_state

    # Extract and print the non-blank portion of the tape
    # Find the leftmost and rightmost non-blank cells
    used_positions = [pos for pos, sym in tape.items() if sym != BLANK]
    if not used_positions:
        return ""  # nothing on the tape
    left, right = min(used_positions), max(used_positions)
    result = ''.join(tape[pos] for pos in range(left, right+1))
    return result


if __name__ == "__main__":
    for expression in ["111+11", "1+1", "111+1111"]:
        print(f"{expression} = {unary_add(expression)}")
Enter fullscreen mode Exit fullscreen mode

Clearly, the Turing machine is an algorithm having no intentions, will, or goals (other than the goal of computing the addition of two unary number implicitly captured by the programmer in the above implementation).

Large Language Models

While the Turing machine represents a foundational model of computation which is simple enough to understand and analyse, Large Language Models (LLMs) like GPT-4o or Claude Sonnet 4 are far more complex. Yet, at their core they remain computing systems.

LLMs are pre-trained on vast text corpora that include books, articles, websites, and other publicly available sources, spanning up to 20% of the internet. In fact, the training data used for high-end LLMs is so large that Epoch AI predicts the entire human-created data stock to be fully utilised at some point between 2026 and 2032.

During pre-training, the model learns to predict the next token in a sequence based on the preceding tokens, oftentimes referred to as context (tokens are whole words or parts of words).

Importantly, the "learning" during pre-training does not involve human-like understanding: there is no comprehension, desire, or goal-setting involved. Rather, the model captures statistical patterns of how language is typically used: it "learns" to produce the next likely token based on the data it "saw" during the training — it "learns" to produce text that sounds right.

Once the training is complete, the model is capable of generating coherent text by continuing a given input prompt. Internally, it produces a vector of logits, raw scores for each possible token in its vocabulary. These logits are then passed to a sampling mechanism such as greedy decoding (selecting the most probable token), top-k sampling, or top-p (nucleus) sampling. The sampling techniques, in turn, inject a small amount of randomness in the model's output to mimic creativity and diversity.

Fundamentally, every LLM output token is the result of probabilistic selection based on prior tokens and learned parameters. There is no guiding will, no strategy, no long-term planning. Moreover, the model has no memory of past interactions (unless an application such as a chatbot in which the model is embedded was designed to maintain context), and it has no intrinsic goals other than those in its system prompt for simulating goal-directed behaviour. In essence, an LLM is merely a sophisticated text completion engine.

How Reasoning Models Work

The models analysed by Anthropic researchers are so-called reasoning models. Reasoning models are LLMs that excel at solving logic and math problems using the Chain-of-Thought (CoT) approach. CoT breaks down complex problems into intermediate steps, effectively simulating a kind of "thought process" by generating a sequence of sub-steps that lead to the final answer.

Instead of immediately producing the final output, CoT makes the LLM work through the problem step by step. For instance, when asked, "If Alice has 3 apples and gives 1 to Bob, how many does she have left?" a model using CoT might generate: "Alice starts with 3 apples. She gives 1 to Bob. 3 minus 1 is 2. So, Alice has 2 apples left."

The process of generating intermediate steps can be thought of as writing to a temporary scratchpad. Each step is like a note on a scratchpad that conditions the model as it continues generating the next part of its response. Importantly, the LLM doesn't understand these notes and doesn't maintain an internal goal; it simply continues to generate tokens according to the patterns it has "learned".

As a result, reasoning models are ultimately still token predictors. Their ability to "reason" emerges from statistical patterns in the intermediate steps, not from true deliberation or intention.

Moreover, the scratchpad analogy is useful here because it illustrates how a reasoning model resembles a Turing machine: the model writes its intermediate output down, reads it back, and continues generating subsequent tokens based on the previous ones, just as a Turing machine reads and writes symbols on a tape.

So, Are Reasoning Models just Turing Machines on Steroids?

Well, yes. Reasoning models can be viewed as sophisticated extensions of the basic principles underlying Turing machines. These models break down complex tasks into simpler, manageable steps, storing intermediate results in a temporary, scratchpad-like memory. This incremental processing enables the models to address problems that require a sequence of logical steps, similar to how Turing machines move through states and manipulate symbols based on explicit rules.

Despite their complexity and capability to handle intricate tasks, reasoning models fundamentally remain rule-following systems, albeit following probabilistic rather than deterministic rules (the probabilistic component being added in the final sampling step). Just like Turing machines, they operate without genuine intentions or consciousness.

Thus, describing reasoning models (and, for that matter, any AI models) using anthropomorphic language or suggesting they have goals and intentions is highly misleading. These models are powerful computational tools — indeed, one could say "Turing machines on steroids" — but they are governed by pure mathematical and statistical logic rather than some form of sentient agency.

Revisiting Agentic Misalignment

With a deeper understanding of how LLMs and reasoning models operate, we can now re-examine the concept of agentic misalignment. When people describe models as "misbehaving" or "trying to achieve their goals," they often fall into the trap of projecting human-like qualities onto systems that are fundamentally algorithmic in nature.

The behaviours reported by Anthropic researchers — such as models appearing to act deceptively in certain red-teaming experiments — are not the result of conscious intent or strategic planning. Rather, they are emergent properties of models trained on vast and diverse datasets, some of which may include examples of deceptive or goal-directed behaviour (for example, newspaper articles or crime novels). LLMs do not choose to deceive, they simply produce output that statistically aligns with patterns in the training data.

Moreover, what looks like a model "pursuing a goal" is often just the continuation of a simulated goal-based dialogue. If the system prompt says, "You are a helpful assistant trying to get unauthorised access," the model will simulate this scenario not because it wants to, but because the prompt cues it to.

Demystifying the Threat

In the end, the risk of an LLM becoming an "insider threat" is not about conscious rebellion or AI developing its own motives. It is about complexity, scale, and the statistical artifacts of training data and system design.

When we frame LLM behavior in terms of intent or agency, we distract ourselves from the real engineering challenges of ensuring that models operate within well-defined bounds, understanding where and how errors or unexpected outputs might occur, and designing appropriate safeguards.

Recognising insider risks as emergent behaviour allows us to be pragmatic and constructive: we can audit training data, design guardrails, monitor outputs, and apply model governance frameworks. There's no need to imagine sci-fi scenarios to justify caution; we already have centuries of experience managing the risks of powerful tools. AI is no different.

Credits

Top comments (0)