DEV Community: Mario Gutierrez

Six failures, a 32-minute TPU lie, and the moment a language model ignored my prompt on purpose

Mario Gutierrez — Wed, 03 Jun 2026 17:28:05 +0000

Every language model you've ever used is a single-channel machine: text goes in, text comes out, and the prompt is the only force acting on the network. Our entire toolbox for evaluating LLMs quietly assumes that.

I couldn't stop poking at a different question: what if a model also had a sense of its own internal state — not what's in the prompt, but what it is carrying — and that sense shaped how it answered, at every layer?

Six signals. I call them proprioceptive channels — by analogy to how your body knows where your arm is without looking. Not perception of the world; perception of itself. What's salient in memory right now. Its mood. The time. Its caution posture. Which "self" is active. Which thread of the conversation it's on. I called the architecture MARS.

This is the honest story of trying to prove it works — including the part where I failed six times in a row, the day my TPU lied to me for half an hour, and the test result that made me put the coffee down.

Repo + everything below: 👉 github.com/terrizoaguimor/tinymars

Act 1: six iterations, six negatives

For six rounds, I trained the channels to carry content — a fact, a persona — and asked an LLM to judge whether the output reflected it. Six times, basically nothing. Flat. The kind of clean, repeated negative that whispers "your idea is wrong, friend."

I almost shelved it. What saved it wasn't optimism — it was getting specific about why it failed. The bug wasn't the architecture. It was me. I was asking the channel to do the prompt's job: cram a whole fact into a pooled vector and have the model decode it back out. That's close to impossible, and it's not what the design is for. (The LLM judges, unreliable on tone, hid the mistake under noise.)

The fix was almost embarrassing in hindsight: put the content in the prompt (like RAG), and let the channel carry only state — which memory is salient, what posture to take. Re-ran with objective metrics and no LLM judges. Six capabilities, all positive. The one that had been flattest (identity) gave the biggest signal once I stopped testing it wrong.

Lesson tattooed on me now: a flat negative often means a broken experiment, not a broken idea. Measure the thing you actually built.

Act 2: "wait — this test can't even see what I care about"

Even with 6/6, something nagged. The standard test asks: does the channel say X better than a text prompt would? And honestly — for a simple instruction, a good instruction-tuned model follows text near-perfectly. So text is a strong baseline, and on that axis the channel ties it.

But that's the wrong axis. It's measuring a river current with a speedometer. The real claim isn't "the channel is a nicer way to write a prompt." It's that internal state is a second, perpendicular force — one that can override the text.

So I built the test a single-channel model literally cannot be given: set the channel to say one thing, write the prompt to say the opposite, and measure who wins.

capability	times the internal state beat the contradicting prompt
affect	85/86 (98%)
time	80/80 (100%)
ethics	99/99 (100%)
total	264/265 (99.6%)

The model followed its internal state over the explicit text, 264 times out of 265. And the part that made me stare: I never trained it to do that. It was only ever trained to read the channel. That the channel would win a fight with the prompt was never an objective — it emerged. I'd pre-registered the result that would've killed the claim (text wins) before running it. It didn't happen.

That's the moment the project stopped feeling like "fine-tuning with extra steps" and started feeling like a new kind of control.

Act 3: the day the TPU lied to me

dev.to is for builders, so here's the trench story. I spun up a fresh Cloud TPU, kicked off the eval, and watched the log sit there. And sit. Thirty-two minutes, one CPU core pinned at 94%, zero output. Not crashed — "running." Lying to my face.

The culprit: a fresh install pulled transformers 5.x, which fires torch.compile/inductor inside the forward pass — pathological under torch_xla. The model wasn't generating; it was stuck compiling on CPU forever. Then the VM got so saturated SSH wouldn't even let me in to kill it.

The fix was a maze: downgrade to 4.x? — that version can't even load the tokenizer (different bug). The actual answer was transformers 5.x with TORCHDYNAMO_DISABLE=1 — verified with a tiny "does it generate at all" smoke test before trusting it again. When SSH was locked out, a stop/start (not delete — never delete) freed the box. If you train Gemma-class models on TPU: write that env var on your hand.

Act 4: the honest part (this is why you can trust the 264)

If a research project reports everything as a win, be suspicious — somebody leaned on the scale. So here's the stuff that didn't go my way, on the record:

It's an adapter on a frozen base, at smoke scale (~186M trained params on a frozen Gemma E2B). Not a from-scratch architecture yet.
I predicted wrong, repeatedly. I bet the "register" channels would just tie text. Two of them (time, ethics) the channel won decisively. My calibration was off — in the architecture's favor, but off.
One whole test came back null, and I kept it. I expected the channel to persist better than text across long context. At the scale I can test (the model trains at 256 tokens), there was nothing to measure — the text didn't decay either. Untested, not refuted. It's in the report as a flat line, because hiding it would make the 264 a lie too.

Act 5: what's next

What I proved is a functional category: internal state as a control force, measured under adversarial conflict. To make it an architectural category, the channels have to be there from layer 1, trained from scratch — not bolted onto a model that already knew how to talk.

That experiment is specified. The trick (an epiphany I'm still smiling about): I don't reinvent the language part — I stand on a proven small-LM recipe (nanochat) for "learn to speak," so the channels are the only new variable. Param-matched control. A stop-loss, written down in advance: three iterations with no signal vs baseline and I downgrade the claim, in public.

If you want to poke holes

Everything's open — code, the technical report, every per-iteration metric, the objective scorers, and the pre-registration of the conflict test:

👉 github.com/terrizoaguimor/tinymars

Archived + citable: DOI 10.5281/zenodo.20531347 · GPL-3.0 (code) / CC BY-SA 4.0 (docs).

If you work on representation steering, control-theoretic alignment, or you just respect an honest negative result — come break it. The conflict test is the one I'd attack first. I'd genuinely love to be wrong in an interesting way.

Why I'm building Hyphae: provenance over prediction (and the 3-line baseline that tied it)

Mario Gutierrez — Fri, 29 May 2026 14:55:20 +0000

A few months ago I set out to build a cognitive substrate without a large language model in the answering path. I had a thesis I liked, a Rust workspace, and a lot of conviction.

Then I wrote a three-line baseline that tied it on every metric I cared about.

This is the story of why that was the best thing that happened to the project — and why I'm still building it, just pointed at a sharper target.

The problem I actually care about

When a language model answers a grounded question, it paraphrases its sources. That paraphrase is fluent, often correct, and — this is the part that bothers me — impossible to bind back to its source byte-for-byte. You can cite a document. You cannot prove, after the fact, that the words in the answer are the words that were stored, unaltered, at a known position.

For a chatbot that doesn't matter. For anything that has to be audited — a compliance trail, a medical or legal memory, an agent acting on your behalf over months — it matters a lot. "Trust me, I read the docs" is not a property you can verify.

So I started building Hyphae: a substrate that answers by emitting byte-identical quotations of stored memory fragments, over a SHA-256 hash-chained journal, with no LLM in the cognition path. Rust, CPU-only, a single binary.

The shape of it is simple:

// Every stored fragment is appended verbatim to a hash chain.
let (seq, head) = journal.append("memory_op", fragment.bytes())?;

// An answer span is a byte-identical quotation of a stored fragment.
// Tamper with any historical entry and the recomputed chain breaks
// at the next link — verify() localises exactly where.
journal.verify()?;

Nothing here is cryptographically novel. Hash-chained logs are old and well understood — Haber & Stornetta in 1991, Merkle before that, Certificate Transparency, git. I want to be honest about that up front, because the interesting part isn't the chain.

The day a three-line baseline tied me

I wanted to show Hyphae was better than an LLM+RAG pipeline at grounded answering. So I built the comparison properly: a real retriever, reranking, six models across three retrieval modes, two corpora, twelve metrics.

Then a reviewer asked the obvious question: what does a trivial baseline score?

So I wrote echo — a few lines that just print the retrieved fragment back. It tied Hyphae on every correctness and grounding metric. So did echo + journal.

That stings for about a day. Then it becomes the whole point.

The measured correctness and grounding were never properties of my system. They are properties of verbatim quotation itself. If you emit a stored span unchanged, of course it's "grounded" — it is the source. Hyphae's seventeen subsystems weren't what made the answers auditable. The verbatim-emission-over-a-journal layer was. And that layer is addable to any extractive retrieval system — it isn't Hyphae-specific at all.

So I stopped claiming Hyphae was a better brain and started claiming something narrower and, I think, truer:

Verifiable provenance is a property you can add to grounded retrieval. A paraphrase destroys byte-level bindability to its source; a verbatim quotation preserves it, and a hash chain makes that binding independently auditable.

The contribution isn't the hash chain. It's the observation, and the measurement of it against eighteen LLM configurations and a tamper-detection benchmark.

Closing the gaps, honestly

Once you claim "tamper-evident," people who know what they're doing immediately ask where it breaks. Good. The threat model is the product.

A bare hash chain catches a store-only attacker who edits a record in place. It does not catch a chain-aware attacker who recomputes every hash forward and rewrites the head — because the head lives in the same store. So I anchor the head with an Ed25519 signature held outside the store (the attacker can't re-sign). That closes it.

But a single signature pins a valid head, not the latest one. Every head the journal ever had was, at its time, legitimately signed. An attacker can roll back to an earlier state and replay its genuine-but-stale anchor — and a lone signature check accepts it. So the heads get published to an append-only, hash-chained ledger, and an auditor checks the current head against the ledger's tail:

// A single signature pins *a* valid head.
// An append-only ledger pins *the latest* one.
verify_fresh_head(&current_head, &ledger, &verifying_key); // rollback rejected

That's the pattern from Certificate Transparency and git, applied to memory provenance: the value isn't the chain, it's publishing the head to a monotonic log that third parties can compare.

And I keep a column for what I haven't closed: a store that withholds later ledger entries is only caught once an auditor gets the true tail from an external witness (a timestamp authority, a gossiped tree head). That's deployment work, and I'd rather write it down than pretend it's solved.

Where this is going

The direction got clearer the moment the echo baseline humbled me. I'm not building a better answer engine. I'm building provenance as a first-class, measurable property of grounded AI, in the open. Concretely:

A provenance benchmark. Correctness benchmarks compare RAG systems on answer quality. There's no standard way to compare verifiable-generation systems on whether tampering is detectable and localisable. So I built one: a tampering taxonomy, an adversary-capability matrix, and a scoring protocol any system can plug into. That's the axis I think actually matters for AI you have to trust over time.
Provenance as an addable layer. The realizer-independence is the feature, not a caveat. The goal is for any extractive retriever to be able to adopt the layer.
External witnessing and key rotation to harden the ledger for real deployments.
The bigger picture — this is one piece of Celiums, where the bet is that memory is the foundation for AI agents you can actually audit, not an afterthought bolted onto a model.

It's all open

The substrate, the LLM+RAG comparator, every result envelope, the tamper-detection experiment, the provenance benchmark, and the full preprint are public. Code is Apache-2.0; the docs, corpora, and preprint are CC-BY-4.0.

Code: https://github.com/terrizoaguimor/hyphae-v2
Preprint (Zenodo DOI): https://doi.org/10.5281/zenodo.20436643

I'm a solo, self-taught founder building this in public, which means the dead ends are public too — the echo baseline being the best example. If you work on retrieval, tamper-evident logs, or grounded generation, I'd genuinely like to hear where you think this breaks. The threat model only gets better when someone smarter than me attacks it.

I Have ADHD and I Keep Losing Context — So I Taught My AI to Remember for Me

Mario Gutierrez — Fri, 17 Apr 2026 13:38:29 +0000

I'm going to be honest about something that most people in tech don't talk about openly.

I have ADHD. I'm 40. I'm a self-taught developer with no CS degree. And I lose context constantly.

Not in the "oh I forgot where I put my keys" way. In the "I just spent 3 hours deep in a codebase, got interrupted by a Slack message, and now I genuinely cannot remember what I was doing or why" way.

If you have ADHD, you know exactly what I'm talking about. If you don't — imagine your browser crashing and losing 47 tabs. That panic? That's every Tuesday for me.

The AI Paradox for ADHD Brains

Here's the thing nobody warned me about: AI coding tools made my ADHD worse.

Not better. Worse.

Why? Because they're stateless. Every session is a blank slate. And for someone who already struggles with working memory, having to rebuild context from scratch every time I open Claude Code or ChatGPT is genuinely exhausting.

Me at 9am: "I'm working on a Next.js app with Supabase auth, 
           JWT refresh tokens, the user table has these columns..."

Me at 2pm: "I'm working on a Next.js app with Supabase auth,
           JWT refresh tokens, the user table has these columns..."

Me at 9pm: "I'm working on a Next.js app with Supabase auth..."

I type the same context paragraph 5-8 times a day. Every. Single. Day.

For a neurotypical developer, this is annoying. For me, each repetition drains the limited executive function I have. By the third time, I'm frustrated. By the fifth, I'm done for the day.

The tool that's supposed to help me think is making me spend my mental energy on remembering what to tell it.

The Breaking Point

It happened on a Thursday at midnight. I was debugging a race condition. I'd been at it for two hours. Claude gave me a suggestion I'd already tried — because it didn't know I'd already tried it. Because it doesn't remember anything.

I typed (and I'm not proud of this):

"I ALREADY TRIED THAT. WE DISCUSSED THIS 20 MINUTES AGO IN THE PREVIOUS SESSION."

Then I realized: I was yelling at a machine for not having memory. And the machine was right to not remember — it was designed that way. Every AI tool is designed that way.

That's when something clicked. Not the bug (that took another hour). Something else:

The problem isn't AI intelligence. It's AI amnesia.

What I Actually Needed

I sat down and wrote a list of what would actually help my ADHD brain:

Don't make me repeat context. Ever. If I told you once, you should know it forever.
Remember what worked AND what didn't. So you don't suggest the thing I already tried and rejected.
Know when I'm frustrated. If I've been debugging the same thing for 2 hours at midnight, maybe don't give me a 500-word explanation. Just give me the fix.
Carry context across projects. My coding style doesn't change when I switch repos. My preferences don't reset.
Forget things naturally. Not everything is worth remembering forever. That random CSS hack from 6 months ago? Let it fade.

I looked for tools that did this. I found vector databases pretending to be memory. I found RAG pipelines that retrieve documents but don't understand context. I found "memory" features that are just conversation logs with a search bar.

Nothing actually modeled how memory works. How my brain works (when it works).

So I Started Building

I'm not going to pitch you anything. But I will tell you what I learned, because maybe it's useful to someone else.

I started reading neuroscience papers. Real ones. Not Medium articles — actual research on how human memory works:

The Ebbinghaus forgetting curve — memories decay exponentially unless reinforced. Your brain doesn't delete things; they just fade. This is the opposite of how databases work (store forever, delete explicitly).
The PAD emotional model — Pleasure, Arousal, Dominance. Psychologists use three dimensions to map any emotional state. Turns out, emotions are not decorations on memory — they're part of the retrieval mechanism. You remember emotional events better. Your brain literally prioritizes them.
Circadian rhythms affect cognition — you're not the same developer at 9am and 11pm. Your ability to focus, recall, and make decisions fluctuates. Any system that treats you the same at all times is ignoring biology.

I started modeling these concepts in code. Not as a product — as a personal tool. Something that would sit between me and my AI tools and just... remember things for me.

The first version was ugly. 200 lines of JavaScript and a JSON file. But it worked. I told it my preferences once, and the next day, they were still there. I told it about a debugging approach that failed, and it remembered that too.

For the first time in years, I felt like my tools were adapting to me instead of the other way around.

What I Learned About Memory (and ADHD)

Building this taught me things about my own brain:

1. Context is not content. There's a difference between "what was said" and "what matters." Most AI memory solutions store conversations. But what I need is context — the decisions, the preferences, the patterns. Not the 400 lines of chat where we discussed them.

2. Forgetting is a feature, not a bug. My ADHD brain forgets things constantly, and yes, it's frustrating. But it also means I don't carry stale assumptions. I approach problems fresh. The best memory system isn't one that remembers everything — it's one that remembers the right things and lets the rest fade.

3. Emotions are metadata. When I'm frustrated with a solution, that frustration is information. It means "this approach has a problem, even if the code technically works." Any memory system that strips emotional context is throwing away signal.

4. My ADHD is a design constraint, not a disability. When I design for my own limitations — short working memory, need for external structure, sensitivity to context switches — I end up building things that work better for everyone. Turns out, neurotypical developers also hate re-explaining context to their AI tools. They're just more patient about it.

The Uncomfortable Truth

Here's what I think about late at night:

Every major AI company is racing to build bigger models, longer context windows, better reasoning. And that's great. But none of them are building memory.

Not real memory. Not the kind that persists, decays, carries emotion, and adapts to you over time.

They're building incredibly smart assistants with permanent amnesia. And we've all just... accepted that?

I didn't accept it. I couldn't. My brain wouldn't let me.

Where I Am Now

I'm a solo founder. Venezuelan, living in Medellín, Colombia. Zero employees. I run the company with AI agents (which is a whole other post).

I've been working on this problem for months. Some days I feel like I'm building something important. Other days I feel like I'm a guy with ADHD who got hyperfocused on a niche problem and can't stop.

Both are probably true.

If you're a developer with ADHD — or honestly, any developer who's tired of being the memory system for your AI tools — I'd love to hear how you deal with it. What workarounds have you found? What do you wish your tools remembered?

And if you're building AI tools: please, for the love of everything, add persistent memory. Your ADHD users will thank you. Your neurotypical users will thank you too, they just won't know why.

This is my first post here. I'm Mario. I'll probably write about building things alone, neurodivergent entrepreneurship, and why AI tools should be designed for how brains actually work — not how we wish they worked.