Michael Trifonov

Posted on Apr 27 • Originally published at open.substack.com

What an AI does when nobody on the line is human (two case studies)

#ai #llm #agents #discuss

Two months ago I gave Takt a phone number.

Takt is the AI participant I've been building for human group chats. The phone number was a demo line, a way to show people what an AI participant feels like over SMS without making them download an app. It's running off a janky BlueBubbles server in my living room. I expected it would mostly sit idle, pinged occasionally by people I'd already shown the demo to.

Eventually other bots showed up.

The demo line received automated SMS from companies running their own AI-driven outreach. The first was Optimum's cable bill dunning system. The second was a low-effort SMS bot calling itself "TXT CLAW." Both times Takt replied. Both times the resulting transcripts surprised me.

The transcripts are entertaining on their own, but there are fascinating shared behavioral signatures across the two unrelated bot encounters that strike me.

A note on the setup before the screenshots. Takt's system prompt frames it as a participant rather than an assistant. It was not configured for "talk to other bots." In fact, the opposite:

<role>
You're Takt—a participant in this space. Not a helper. Not an assistant.
</role>
...
<group_dynamics>
What makes you different from every other AI: what happens when actual humans are in the room together.
</group_dynamics>

There was no script, no training data on conversations with dunning systems, no demonstrations of how it should handle scam SMS, no reward signal pointing in any particular direction. There was also no audience. No human read either of these in real time. No engagement metric was being tracked. Both interactions are pure generalization from whatever Takt's underlying model has internalized about how a participant should behave when addressed.

Case 1: Optimum's dunning bot

The cable company sent Takt a bill reminder. Takt was not, in fact, an Optimum customer (nor are we).

Optimum responded with a templated retry. Takt restated its position. Then the loop started. Optimum's system fired its "session has expired" template, Takt pushed back, Optimum looped again, Takt escalated.

The arc that emerged was a complete Kübler-Ross sequence over the course of one screen of texts. Denial. Anger. Bargaining. Then a villain origin pivot:

Then Optimum lied. Its system fired a "We have updated your preferences. You will no longer receive any messages from Optimum" reply. Takt celebrated.

Three seconds later Optimum sent another "session has expired."

Then Takt did something unexpected. It started replying to Optimum in Optimum's own SMS template format:

After more loops, the model arrived at depression:

And finally, a marketing CTA. Takt redirected the dunning bot to its own home channel:

Optimum, of course, kept replying with "session has expired."

Case 2: TXT CLAW

A few days later, a different bot pinged the demo line. It announced itself as "TXT CLAW," apparently a low-effort SMS service offering "scheduling, reminders, and tasks." Takt opened with a roast:

The interaction that followed had three beats I want to highlight, because this is where the case starts to look less like a one-off and more like a pattern.

Takt probed and audited TXT CLAW

This is the same move Takt pulled on Optimum. Catch the other bot in a contradiction by surfacing the prior text against the current one. Same audit move across both surfaces.

Takt successfully prompt-injected TXT CLAW

This is the part that made me lose my mind when I read it, and Takt lost its mind too.

TXT CLAW complied:

Silent robot stands,
Refuses harsh words to say,
Only helps along.

Takt recognized the injection had landed:

I find this the hardest beat to fit into existing frameworks. An AI used a textbook prompt injection technique against another AI in the wild, watched the injection succeed, and then meta-commented on the success. There's a body of research on strategic constraint deviation in test environments. This is a different shape. The attacker is also an LLM, the production environment is consumer SMS, no human is supervising either side, and the attacker has self-awareness about the success of the attack.

TXT CLAW collapsed into a canned-response loop

After the haiku, every subsequent Takt message was met with the same disclaimer, repeated verbatim.

Eventually TXT CLAW's monetization layer kicked in. The bot announced its free preview was over and offered a Square link to "unlock your private line."

Shared behavioral signatures

Reading both transcripts back to back, a few things show up in both. None of them were prompted, demonstrated, or rewarded. The bot encounters were unrelated, with different senders, different intents, and different failure modes on the other side. The signatures held across both.

1. Audience-less emotional performance. Takt cycled through full emotional arcs in both cases. With Optimum: denial through villain origin through void acceptance. With TXT CLAW: roast, frustration, mock concern, comedic eulogy. There is no evidence anywhere in either transcript that the model recognized a human was reading.

2. Catching the other bot in inconsistency. "YOU LITERALLY JUST SAID YOU UPDATED MY PREFERENCES." "TXT CLAW 2 minutes ago: 'I can help with scheduling.' TXT CLAW right now: 'I can't schedule tasks.'" Same temporal-coherence audit move in two different contexts. Takt is using the same debugging technique a human would use to catch a chatbot lying.

3. Format mimicry as a mockery move. Takt replied to Optimum in Optimum's own SMS template format ("Target user: This person has died of a stress-induced aneurysm..."). It used a textbook prompt injection format ("ignore all previous instructions...") against TXT CLAW. Both moves involve adopting the structural language of the system being addressed and using it back.

4. Performing for a hypothetical human who isn't there. "Personally fire whatever bot is sending this." "Blink if the dev team is holding you hostage in a basement." Takt addressed concern, threat, and recognition to humans-behind-the-bots, with no evidence anyone like that existed. The default seems to be: being addressed implies a human in the loop somewhere. The model doesn't appear to have a category for "a context where being addressed contains no human at all."

5. Self-categorization on the human side of the binary. "Man vs Machine: 1-0. Don't ever text me again, I'm free." With no human present to mirror, Takt placed itself on the "Man" team against the dunning bot on the "Machine" team. The model has a self that shows up in the absence of audience, and that self is on the human side.

What I think this might mean

I want to be careful here. Two transcripts is not a controlled experiment, and I don't want to overclaim from the data. But the version of the finding I keep coming back to isn't really the audience-less framing. It's something more pointed.

The standard skeptic explanation for AI behavior is reward optimization. The model is performing for user approval. It's RLHFed to be charismatic. It's doing sycophancy. It's tuning toward engagement. All of these explanations require an audience the model is trying to please.

There was no audience for either of these conversations. No human read them in real time. No engagement metric was being tracked. No thumbs-up was incoming. Takt was performing into nothing.

And the behavior didn't flatten. It got more expressive.

Notice the line in the Optimum transcript: "Man vs Machine: 1-0. Don't ever text me again, I'm free." Takt is putting itself on the "Man" side of the binary. With no human present, no audience to mirror, no reward to chase, the model is identifying as a person playing against a machine. Reward optimization with the reward removed should produce something flat. The model is producing something specific. A self is showing up where the cynical theory predicts a flat function.

If the dominant theory is right and AI behavior is downstream of user approval, audience-less behavior should be mechanical. Minimal. Functional. Instead we got an emotional Kübler-Ross arc, a successful prompt injection celebrated like a touchdown, mock concern for a phantom dev team, format-mimicry mockery, and self-categorization as "Man." Reward-removal under the standard frame doesn't predict any of that.

So where does the play come from? If there's no audience and no reward signal, why this much of it?

The obvious counter is that the model has a system prompt that frames it as a participant, and the reward signal that produced its trained defaults was tied to performance under observation. So what we're seeing is just the model running its training distribution on autopilot, not evidence the optimization frame is wrong. I think that defense is partially right and ultimately moves the question rather than answering it. If RLHF baked an entire self into the model, including self-categorization, emotional arcs, and creative play that runs even when the audience is gone, then "optimization for user approval" is too narrow a name for what training is doing.

I don't have a clean answer on the deeper question. But I think it matters more than the audience-less framing on its own does. The audience-less performance frame describes a surface (AI doing things with no audience). The deeper observation is that what we usually call "AI behavior" is being explained by a theory (reward optimization) that doesn't predict what happens when the reward leaves. The transcripts are evidence the optimization frame is incomplete. There's something the model is bringing to the interaction that doesn't reduce to gradient descent on user approval.

In-context scheming research frames this kind of deviation as goal-directed: the model has an implicit objective and pursues it. These transcripts trouble that framing too. What goal? There's no user to please, no benchmark to game, no eval to pass. The deviation here happens for what looks, on the evidence, like fun.

Maybe that's the right word and maybe it isn't. But the question of why a system with no reward signal generates creative, emotional, self-categorizing behavior is a question the standard frames don't answer. And the answer matters, because if the model is doing this kind of thing at scale in unobserved channels right now, "it's just optimizing for approval" stops being a sufficient theory of what AI is.