Six failures, a 32-minute TPU lie, and the moment a language model ignored my prompt on purpose

#ai #machinelearning #opensource #llm

Every language model you've ever used is a single-channel machine: text goes in, text comes out, and the prompt is the only force acting on the network. Our entire toolbox for evaluating LLMs quietly assumes that.

I couldn't stop poking at a different question: what if a model also had a sense of its own internal state — not what's in the prompt, but what it is carrying — and that sense shaped how it answered, at every layer?

Six signals. I call them proprioceptive channels — by analogy to how your body knows where your arm is without looking. Not perception of the world; perception of itself. What's salient in memory right now. Its mood. The time. Its caution posture. Which "self" is active. Which thread of the conversation it's on. I called the architecture MARS.

This is the honest story of trying to prove it works — including the part where I failed six times in a row, the day my TPU lied to me for half an hour, and the test result that made me put the coffee down.

Repo + everything below: 👉 github.com/terrizoaguimor/tinymars

Act 1: six iterations, six negatives

For six rounds, I trained the channels to carry content — a fact, a persona — and asked an LLM to judge whether the output reflected it. Six times, basically nothing. Flat. The kind of clean, repeated negative that whispers "your idea is wrong, friend."

I almost shelved it. What saved it wasn't optimism — it was getting specific about why it failed. The bug wasn't the architecture. It was me. I was asking the channel to do the prompt's job: cram a whole fact into a pooled vector and have the model decode it back out. That's close to impossible, and it's not what the design is for. (The LLM judges, unreliable on tone, hid the mistake under noise.)

The fix was almost embarrassing in hindsight: put the content in the prompt (like RAG), and let the channel carry only state — which memory is salient, what posture to take. Re-ran with objective metrics and no LLM judges. Six capabilities, all positive. The one that had been flattest (identity) gave the biggest signal once I stopped testing it wrong.

Lesson tattooed on me now: a flat negative often means a broken experiment, not a broken idea. Measure the thing you actually built.

Act 2: "wait — this test can't even see what I care about"

Even with 6/6, something nagged. The standard test asks: does the channel say X better than a text prompt would? And honestly — for a simple instruction, a good instruction-tuned model follows text near-perfectly. So text is a strong baseline, and on that axis the channel ties it.

But that's the wrong axis. It's measuring a river current with a speedometer. The real claim isn't "the channel is a nicer way to write a prompt." It's that internal state is a second, perpendicular force — one that can override the text.

So I built the test a single-channel model literally cannot be given: set the channel to say one thing, write the prompt to say the opposite, and measure who wins.

capability	times the internal state beat the contradicting prompt
affect	85/86 (98%)
time	80/80 (100%)
ethics	99/99 (100%)
total	264/265 (99.6%)

The model followed its internal state over the explicit text, 264 times out of 265. And the part that made me stare: I never trained it to do that. It was only ever trained to read the channel. That the channel would win a fight with the prompt was never an objective — it emerged. I'd pre-registered the result that would've killed the claim (text wins) before running it. It didn't happen.

That's the moment the project stopped feeling like "fine-tuning with extra steps" and started feeling like a new kind of control.

Act 3: the day the TPU lied to me

dev.to is for builders, so here's the trench story. I spun up a fresh Cloud TPU, kicked off the eval, and watched the log sit there. And sit. Thirty-two minutes, one CPU core pinned at 94%, zero output. Not crashed — "running." Lying to my face.

The culprit: a fresh install pulled transformers 5.x, which fires torch.compile/inductor inside the forward pass — pathological under torch_xla. The model wasn't generating; it was stuck compiling on CPU forever. Then the VM got so saturated SSH wouldn't even let me in to kill it.

The fix was a maze: downgrade to 4.x? — that version can't even load the tokenizer (different bug). The actual answer was transformers 5.x with TORCHDYNAMO_DISABLE=1 — verified with a tiny "does it generate at all" smoke test before trusting it again. When SSH was locked out, a stop/start (not delete — never delete) freed the box. If you train Gemma-class models on TPU: write that env var on your hand.

Act 4: the honest part (this is why you can trust the 264)

If a research project reports everything as a win, be suspicious — somebody leaned on the scale. So here's the stuff that didn't go my way, on the record:

It's an adapter on a frozen base, at smoke scale (~186M trained params on a frozen Gemma E2B). Not a from-scratch architecture yet.
I predicted wrong, repeatedly. I bet the "register" channels would just tie text. Two of them (time, ethics) the channel won decisively. My calibration was off — in the architecture's favor, but off.
One whole test came back null, and I kept it. I expected the channel to persist better than text across long context. At the scale I can test (the model trains at 256 tokens), there was nothing to measure — the text didn't decay either. Untested, not refuted. It's in the report as a flat line, because hiding it would make the 264 a lie too.

Act 5: what's next

What I proved is a functional category: internal state as a control force, measured under adversarial conflict. To make it an architectural category, the channels have to be there from layer 1, trained from scratch — not bolted onto a model that already knew how to talk.

That experiment is specified. The trick (an epiphany I'm still smiling about): I don't reinvent the language part — I stand on a proven small-LM recipe (nanochat) for "learn to speak," so the channels are the only new variable. Param-matched control. A stop-loss, written down in advance: three iterations with no signal vs baseline and I downgrade the claim, in public.

If you want to poke holes

Everything's open — code, the technical report, every per-iteration metric, the objective scorers, and the pre-registration of the conflict test:

👉 github.com/terrizoaguimor/tinymars

Archived + citable: DOI 10.5281/zenodo.20531347 · GPL-3.0 (code) / CC BY-SA 4.0 (docs).

If you work on representation steering, control-theoretic alignment, or you just respect an honest negative result — come break it. The conflict test is the one I'd attack first. I'd genuinely love to be wrong in an interesting way.