DEV Community: marcosomma

The Model Does Not Need Memory. The Situation Does.

marcosomma — Mon, 29 Jun 2026 11:50:14 +0000

I think I was asking the wrong question.

For a while, the question was simple: does memory make agents smarter?

It sounds like the right question. It is also a trap, because it assumes that memory should be judged as a generic intelligence booster. You add a memory layer, the agent remembers more things, and somehow the output should become better. More complete. More accurate. More human. More whatever word we are currently using to avoid saying “I hope this expensive thing works.”

After running more experiments, I think that framing is wrong.

Memory does not make the model smarter in any general sense. Most of the time, it cannot. The model already has a massive amount of general procedural and domain knowledge compressed into its weights. If your memory layer recalls information the model already knows, you are not adding intelligence. You are just adding a second path to say the same thing with more latency.

That is the uncomfortable part. A lot of agent memory systems are not failing because recall is broken. They are failing because the recalled information has no marginal value.

They are giving the model something it was already going to do.

The experiment was not a victory lap

I ran a follow-up benchmark on OrKa Brain: 250 tasks, five tracks, brain versus brainless. The point was to see whether procedural memory would show stronger results at scale and across more task types.

It did not.

The absolute rubric score was almost flat. Brain scored 8.39. Brainless scored 8.27. That is a +0.12 difference on a 10-point scale. Technically positive, but not the kind of result you use to announce that persistent memory has unlocked a new era of agent intelligence unless your relationship with evidence is mostly decorative.

The pairwise result looked slightly better at first. Brain won 53.8% of the comparisons. But then the judge showed a 74.4% first-position bias, which means the raw pairwise score was not something you can read directly. If the judge picks the first answer almost three times out of four, the measurement instrument is not exactly wearing a lab coat. It is flipping a biased coin and writing a confident explanation afterward.

Once I controlled for position, most of the supposed Brain advantage disappeared. Cross-domain transfer did not survive. Anti-pattern avoidance did not survive. Multi-skill composition collapsed into a coin flip. Routing was confounded and needs to be re-run.

Only one track survived: the long same-domain sequence. In that track, the Brain won 74% of the time even when placed in the disfavored position.

That is the part worth keeping.

Not because it proves that memory works in general. It does not. It proves something narrower and more useful: memory seems to help when the output depends on previous state in the same evolving situation.

That is a very different claim.

The ceiling was the message

At first glance, you could say the Brain failed. I think that is too easy.

The better reading is that the benchmark exposed the wrong class of memory. Most of what OrKa Brain recalled was procedural knowledge: how to decompose a task, how to reason through trade-offs, how to avoid generic mistakes, how to structure a solution, how to transfer a pattern from one domain to another.

The problem is that capable LLMs already know a lot of this. They have seen endless examples of architecture reviews, debugging sessions, migration plans, support flows, project retrospectives, Stack Overflow arguments, incident reports, code reviews, framework docs, and corporate documents that somehow use five pages to say “we forgot the cache.”

So when the memory layer recalls a generic procedure, the model does not suddenly receive missing information. It receives a reminder of something already present in the weights.

That is why the ceiling effect matters. It is not just an implementation failure. It is a category warning. If memory stores general competence, the model can often route around it because the model already has general competence.

This is why “agent memory” can look impressive in demos and then become strangely weak in benchmarks. In a demo, remembered context feels useful because we can see the system referencing the past. In a benchmark, if the remembered thing does not change the answer, the effect disappears into style, verbosity, or judge preference.

The useful question is not “did the system remember something?”

The useful question is “did the remembered thing contain information the model could not have known or safely inferred?”

That is the line.

Memory is for contingent information

The sharper version of the theory is this:

Memory helps when the answer depends on contingent information absent from the model weights.

Not vertical knowledge. Not domain knowledge. Not “more context” as a generic spell. Contingent information.

By contingent, I mean information whose truth depends on a specific user, system, company, customer, codebase, previous decision, local process, or moment in time. It is information that is not true in general. It is true here.

The model can know how software migrations usually fail. It cannot know that in this codebase, the last migration failed because the billing worker silently depended on a deprecated Redis key.

The model can know what concise writing is. It cannot know that a specific user prefers direct technical answers with no filler, no fake enthusiasm, and no corporate perfume sprayed over the paragraph.

The model can know how customer support triage works. It cannot know that this specific customer always reports billing bugs using the wrong product name.

The model can know how a deployment pipeline usually works. It cannot know that this team avoids Friday releases because one rollback path still depends on a manual script nobody wants to admit exists.

That is memory.

Everything else risks becoming a second copy of generic knowledge attached to a model that already has the first copy.

Codebase AI makes the distinction obvious

Take software engineering, because the mistake becomes very clear there.

A naive approach says: “We need memory so the model knows how to code.” That sounds reasonable until you look at what the model already knows. A capable model has broad exposure to programming languages, design patterns, API conventions, testing strategies, refactoring techniques, infrastructure patterns, and the usual graveyard of best practices everyone quotes and half the industry ignores.

If you use memory to teach it generic software knowledge, you may hit the same ceiling as procedural memory. The model already knows that functions should be small, tests should cover edge cases, migrations should be reversible, and distributed systems enjoy ruining your afternoon. Storing those ideas as memory does not add much. It mostly gives the model a slower way to say what it was already going to say.

The problem is not that the model does not know software engineering. The problem is that it does not know this software system.

That means generic engineering knowledge should not be treated as the valuable memory layer. It is already in the model. The valuable layer is the local shape of the codebase: naming conventions, architectural scars, forbidden dependencies, deployment constraints, hidden coupling, flaky tests, internal abstractions, legacy decisions, and the reason why one ugly function must not be “cleaned up” unless you enjoy incident reports.

The job of repository retrieval is not to remind the model what clean code is. The job is to show it what this codebase actually is.

The model can know how authentication usually works. It cannot know that in this system, the admin role is duplicated across two services because a migration was only half-completed in 2022.

The model can know that Redis keys should have clear ownership. It cannot know that billing still depends on a deprecated cache key because one worker was never moved to the new event pipeline.

The model can know how to write a database migration. It cannot know that this team avoids destructive migrations unless the rollback plan is reviewed by the person who still remembers why the old schema exists.

That is where memory earns its keep.

A style guide tells the model how code should look. Repository history tells it why the code looks wrong but still works. Incident memory tells it where the bodies are buried. Team conventions tell it what changes will be accepted or rejected before CI even has the chance to complain.

Those are not the same layer, and treating them as one generic RAG pile is how you get a system that retrieves a lot and understands very little.

In software products, this distinction matters. Documentation grounds. Repository state shapes. Incident and decision history constrain.

If those three jobs are mixed together, the assistant may still sound like a senior engineer. It may even sound more senior than before, which usually means it has learned to say “trade-off” with confidence. The question is whether it understands the specific system in front of it.

That is where memory stops being decoration and starts being useful.

Chat memory is the same pattern

This is not only a codebase problem. Chat memory shows the same structure in a smaller and more familiar form.

When a system remembers that a user wants concise answers, it is not helping because the model lacked the concept of concision. The model knows how to be concise. The useful information is not “what does concise mean?” The useful information is “what does concise mean for this user?”

That is an entity-bound fact. It attaches a general capability to a specific person.

This is why personalization can be useful without being intellectually deep. The value is not in teaching the model a new writing style from scratch. The value is in selecting the right behavior under a known identity.

The same thing happens inside companies. The model knows how to review code, but it does not know that this repository bans a specific library because it caused a production incident three years ago. The model knows how to process refunds, but it does not know that annual partner contracts have a different refund path. The model knows how to summarize a customer complaint, but it does not know that this customer always describes production incidents as “minor issues” because they are polite to the point of self-sabotage.

That local weirdness is not noise. It is the work.

Real systems are not made only of general rules. They are made of exceptions, scars, habits, conventions, previous mistakes, policy shortcuts, undocumented dependencies, and things everyone knows until the one person who knows leaves the company.

That is the material memory should store.

Not generic competence. Operational sediment.

This changes the architecture

My earlier framing was closer to skill reuse. The agent learns a skill, stores it, recalls it later, and applies it to a new task. It is a clean model. It is also suspiciously diagram-friendly, which should always make us nervous.

The data pushes me toward a different architecture.

A memory system should not start by asking, “What skill is similar to this task?” That can work sometimes, but it is too weak as the central retrieval principle. Similarity is not the same as necessity. A memory can be semantically close and still useless if it does not change the answer.

The better retrieval question is: “What information about this situation would be impossible, unsafe, or expensive for the model to infer?”

That question changes everything.

You retrieve because the task depends on a known entity. You retrieve because the answer varies by product, customer, repository, environment, deployment target, or team convention. You retrieve because the user has stable preferences. You retrieve because the codebase has local rules. You retrieve because the customer has a history. You retrieve because the system has failed before in a way that looks relevant now. You retrieve because the model is about to answer with generic competence where specific state is required.

That is a different trigger from semantic similarity.

The trigger is underdetermination. The model can produce an answer, but the answer is not sufficiently determined by the prompt and general knowledge. It needs local state.

This also means memory should not be one bucket. A serious system probably needs separate stores with separate jobs.

Grounding is for authority. It gives the model exact sources, current docs, repository files, API contracts, schemas, config, and verifiable references.

Operational knowledge is for local shape. It tells the model how this product, repository, customer, workflow, team, or deployment environment behaves.

Episodic memory is for history. It tells the model what happened before with this user, customer, task, session, service, or system.

Reasoning composes the three.

When people collapse all of this into “memory,” they make the system harder to evaluate. They also make it easier to fool themselves, which is apparently the default MLOps workflow whenever agents are involved.

The next benchmark should be crueler

The old question was whether memory improves output quality. That is too broad.

If the model can answer well using general competence, memory will struggle to show a large effect. A judge may prefer one answer over another, but you will end up measuring style, verbosity, formatting, position bias, or the judge’s breakfast.

The next benchmark should make memory necessary.

A task should require a fact that exists only in memory or grounding. Without that fact, the model should either fail, hedge, or produce a generic answer. With that fact, the model should answer correctly.

Not more beautifully. Correctly.

For a chat assistant, the answer might depend on a stored user preference. For a code assistant, it might depend on a repository-specific rule, a past incident, a banned dependency, a migration constraint, or a team convention that exists nowhere in the model weights. For support automation, it might depend on a customer-specific exception, a product-plan quirk, or a previous escalation pattern. For routing, it might depend on a previous failed path inside the same system.

The score should not be “which answer feels more complete?” That is how judge bias walks into the room, sits at the table, and starts grading vibes.

The score should be concrete. Did the system use the right repository rule? Did it respect the deployment constraint? Did it remember the user preference? Did it avoid the known failure path? Did it cite the current internal document? Did it preserve the prior product decision? Did it distinguish generic knowledge from contingent state?

That is where memory should show a real gap. Not a +0.12 polish gap. A correctness gap.

If memory cannot win there, then the memory system is probably not doing much. If it does win there, then the value is no longer mystical. It is measurable.

The anti-hype version

I do not think “memory makes agents smarter” is the right claim. It is too vague, and vague claims are where AI hype goes to reproduce.

The sharper claim is less exciting but more useful: memory helps when the task depends on information that is not in the model weights and cannot be safely inferred from general knowledge.

This explains why generic procedural recall barely moved the needle. It explains why long same-domain recall was the only signal that survived. It explains why chat preferences matter. It explains why code assistants need repository state plus decision history, not just more documentation. It explains why enterprise assistants become useful only when they know the local weirdness of the company.

The model already has the average.

Memory is for the deviation.

And most real work is deviation.

That is the part I think we keep missing. We keep building memory systems as if the model is empty and needs to be filled. But the model is not empty. It is full of averages. Full of general patterns. Full of plausible procedures. Full of the kind of answer that is usually right until it meets a real organization, a real customer, a real codebase, or a real human being with preferences that do not fit the median.

Memory is not there to compete with pretraining.

Memory is there to correct the average when the local situation demands it.

The boundary I would use now

I would draw the boundary like this.

Do not store what the model already knows. Do not retrieve what the model can safely infer. Do not call generic advice memory. Do not confuse domain knowledge with contingent state.

Store the things that change the answer because they are specific, current, local, personal, historical, procedural, repository-bound, customer-bound, product-bound, or system-bound.

Use grounding for authority. Use memory for contingency. Use reasoning for composition.

Mixing those three together is how you build a very expensive autocomplete with a scrapbook attached.

The benchmark did not prove that memory is useless. It proved that memory has to earn its place. If the remembered thing does not change the answer, it is not memory in any meaningful operational sense. It is just noise with a timestamp.

The model does not need memory.

The situation does.

Maybe It Is Not Yet Time To Bring Every AI Demo To Production

marcosomma — Tue, 23 Jun 2026 14:26:29 +0000

There is a sentence I keep hearing in AI engineering that sounds innocent, practical, and mature: “Just add a fallback provider.”

Clean. Elegant. Wonderful. The kind of sentence that usually survives only until production starts touching it. Because in a demo, fallback means: Provider A fails, call Provider B. In production, fallback means something very different.

Will Provider B interpret the prompt in the same way? Will it serialize the tool schema in the same way? Will it respect cache directives in the same way? Will it stream tokens in the same way? Will it expose errors in the same way? Will it count tokens in the same way? Will it respect timeouts in the same way? Will it fail in a way your system can actually understand?

Most of the time, the answer is no.

And this is where the current AI industry keeps doing its favorite magic trick. It takes something deeply unstable, wraps it in a familiar API shape, gives it a shiny compatibility label, and suddenly everyone behaves as if we have a standard. We do not have a standard. We have a costume!

Most of famous “OpenAI compatible" APIs are laying, and hiding the lack of standards behind a known name. In reality the is compatible only on the shallowest path. You can send a basic chat request and get text back. Fantastic. The demo works. The slide looks good. The architecture diagram has fewer boxes. But the moment you move beyond “hello model, summarize this paragraph”, things start to fracture.

Tool calling. Structured output. JSON enforcement. Prompt caching. Streaming. Retry behavior. Usage accounting. Model aliases. Safety overlays. Regional routing. Timeout semantics. Error objects. Response envelopes. Context handling. Provider-specific parameters. All the boring parts. In other words, all the parts that decide whether your AI system survives production.

As we know, the demo works because the demo is NOT the system

The demo is usually a happy path. One user. One model. One provider. One prompt. One task. Maybe no cache. Maybe no concurrency. Maybe no structured output. Maybe no audit trail. Maybe no fallback. Maybe no customer-specific version pinning. Maybe no compliance requirement. Maybe no cost pressure. Maybe no incident where several parallel streams connect and then produce absolutely nothing for two minutes.

In that world, AI feels magical. In production, AI feels like distributed systems decided to have a child with legal ambiguity and probabilistic behavior.

You are not only integrating a model. You are integrating a runtime. And that runtime is usually not specified clearly enough. This is the part people keep missing.

The model is not the full product. The provider’s serving stack is part of the product. The SDK is part of the product. The serialization layer is part of the product. The cache implementation is part of the product. The safety wrapper is part of the product. The regional routing strategy is part of the product.

So when someone says, “it is the same model”, I increasingly hear: “we did not measure the parts around the model.”

Same weights do not mean same behavior. Same model family does not mean same production contract. Same endpoint shape does not mean same system.

Same model, different reality

One of the strongest examples I have seen came from a direct comparison between Provider A and Provider B using the same model family on real production-like workflows. The headline looked simple: same model family, different provider path. The result was not simple.

On Workflow 1, the quality regression was statistically significant. Provider A had a mean score of 0.716. Provider B had a mean score of 0.497. The p-value was below 0.0001, with a medium effect size. That is not “a bit of noise.” That is the kind of difference that should stop a migration.

The interesting part is that not every workflow regressed. On Workflow 2 and Workflow 3, the result was basically fine.

Good. That makes the result more credible, not less. Because real provider migrations do not fail everywhere. They fail in specific workflows, specific prompts, specific schema paths, specific flows, specific edge cases. The average can look acceptable while one critical workflow quietly gets worse.

This is exactly why “we tested a few prompts manually and it looked okay” is not engineering. It is theater with curl commands.

If you want to switch provider, you need replay traces. You need evals. You need per-workflow scores. You need statistical comparison. You need to know where the behavior changed, not just whether the model still speaks fluent corporate English.

Cost is not only list price

The same comparison showed around a 2x cost premium on a high-volume workflow. At first glance, you might blame provider pricing. But the back-calculation pointed somewhere more boring and more dangerous: prompt caching.

On Provider A, implied token volume was 60 to 67 percent below reported tokens. That is the cache signature. You are still sending the structure, but you are not paying the full input cost every time because the provider is reusing cached prompt blocks.

On Provider B, one high-volume path showed exactly 0 percent gap. Cache was either off or always missing. Other paths showed partial cache behavior, around 14 to 21 percent in one case and around 33 percent in another.

Same model family. Different cache reality. Different bill. This is where the “just switch provider” crowd usually becomes very quiet.

Because caching is not decoration. In high-volume AI systems, caching is part of the economic architecture. If cache semantics change, your unit economics change. If regional routing causes cache misses, your cost model changes. If one provider respects cache directives differently from another, your production bill changes while every individual request still “works.”

That is the worst kind of failure. The successful one. No exception. No stack trace. No screaming service. Just a quiet invoice telling you the abstraction was fake.

Cross-region cache is a beautiful little trap

Cross-region inference sounds robust. More regions. More availability. More resilience. Then you look at the cache behavior.

A request served in Region A writes a cache in Region A. The next request may route to Region B. Region B does not have that cache. So it misses and writes again. Then another call may route back to Region A, or somewhere else, depending on capacity and routing.

This is not a clean “double pay” situation. It is worse conceptually. You keep paying the cache write premium without reliably amortizing it through cheap cache reads. That is how you can end up with a measured 0 percent hit rate while thinking you configured caching correctly.

Again, from the outside everything looks compatible. The API accepts your request. The model responds. The integration works. Except the economics are different because the serving layer changed.

This is why AI production work is becoming less about prompts and more about contracts. What exactly is guaranteed? What is pinned? What is regional? What is cached? What is counted? What is replayable? What is stable?

If the answer is “trust us, it is compatible”, my engineering translation is simple: no contract found.

Reliability is not portable either

Another production-style incident: under concurrency, multiple parallel streams connected and then produced nothing for roughly two minutes. No tokens. No useful error. Just waiting.

The likely reading was capacity or throttle queueing. The provider may have been holding the request instead of returning a clean throttling response. Depending on the endpoint, one path may queue in-flight work while another may throw a clear rate-limit error.

That distinction matters. A clear rate-limit error is ugly but useful. You can react to it. You can retry with backoff. You can trigger fallback. You can protect the system. A connected stream producing nothing for two minutes is a different species of failure. Your system is alive enough to wait and dead enough to be useless.

There was also a competing hypothesis: maybe the network layer was involved. Gateway behavior, private endpoints, load balancers, idle timeouts, streaming connection drops, or capacity errors could all produce overlapping symptoms.

So the correct response was not “the provider is bad.” The correct response was: inspect runtime metrics during the hang windows. Check throttle counters. Check server error counters. Check network timeouts. Check connection lifetime. Check whether the request reached the model runtime at all.

This is what production AI looks like. Not prompt magic. Not demo videos. Not “look, I built an agent in 20 minutes.” It looks like debugging whether a zero-token two-minute hang is caused by model capacity, runtime queueing, network infrastructure, streaming semantics, retry policy, or your own concurrency design.

Very glamorous. Someone should put that in the launch video.

Structured output is not standard output

Then there is the SDK serialization problem. Same model. Same app-level input. Different token count.

One comparison showed Provider B using around 10,473 input tokens while Provider A used around 10,019. That is a 454-token delta, roughly 4.5 percent.

The clue was structured output. On one provider path, structured output was implemented by injecting a tool schema into the prompt. On the other path, it was handled differently. Even after making payloads byte-identical at the application level, the remaining structural difference came from provider-specific cache directive serialization.

This is a perfect example of why API compatibility is not enough. Your prompt may be identical. Your provider prompt is not.

The actual thing seen by the model may include hidden scaffolding, injected schemas, translated parameters, safety wrappers, tool definitions, response constraints, or provider-specific envelopes. Then we compare eval scores and pretend we tested the same thing.

Did we? Maybe. Maybe not. And if we cannot answer that confidently, then we are not measuring model quality. We are measuring a mix of model behavior, SDK translation, provider scaffolding, and our own assumptions.

Very scientific. Very enterprise. Very “move fast and accidentally compare different systems.”

Fallback can become the outage

Cross-provider fallback sounds responsible. It can be responsible. But it is not free.

One concrete incident involved a preview model on Provider C. The model had intermittent hangs, produced retry-exhaustion errors after repeated timeouts, and reported zero input tokens and zero output tokens. So the model did not even really start.

The retry budget burned for several minutes. Then a failure-rate guard aborted the whole job. The fix was to add a fallback model. Good fix. But the lesson is bigger.

The fallback path needs its own engineering. It needs its own timeout budget. It needs its own cost assumption. It needs its own quality expectation. It needs its own reason to exist.

A useful rule: scale up on fallback by default. If fallback runs rarely, a usable answer matters more than saving a few cents. Scale down only when the primary failed because the request exceeded model limits.

But if your fallback inherits the same exhausted timeout budget from the primary, congratulations, you did not build fallback. You built a decorative second failure.

Fallback is not a backup model. Fallback is a second production path.

Even parameter names are not portable

Small example, but very revealing: a “compatible” API for a model behaved differently around a reasoning-related parameter. The workaround was to force a safe default.

That is reasonable. But the real portability risk is not only the value. It is the parameter contract itself.

Another provider may call it something else. Another may ignore it. Another may reject it. Another may apply a different default. Another may support it only on some models. Another may support it in preview and remove it later with very little warning.

This is where “compatible API” starts to feel like saying every car is steering-wheel-compatible. Technically true. Please do not use that as your safety case.

Preview is not production

A lot of AI teams are building production workflows on preview models, preview parameters, preview endpoints, preview SDK behavior, and preview pricing assumptions. Then they act surprised when preview behaves like preview.

Preview can mean weaker guarantees. It can mean limited support. It can mean behavior changes. It can mean short deprecation windows. It can mean different rate limits. It can mean hidden routing changes. It can mean features that work today and become “not recommended” tomorrow.

That is fine for exploration. That is not fine when your production system depends on it and nobody wrote down the risk.

Again, the issue is not that preview exists. Preview is useful. The issue is pretending preview is stable because the demo worked.

We need stable interfaces, not demo optimism

I am increasingly convinced that production AI needs something closer to long-term-support thinking.

Not because models should stop improving. They will improve. The field moves fast. Fine. But production systems cannot keep pretending that every model upgrade, provider switch, SDK change, cache behavior update, or model alias movement is harmless.

When a system is performing fine, switching the model or serving path can create more issues than benefits.

The defensible version is conditional: long-term support becomes inevitable when capability growth slows enough that stability outweighs the next incremental benchmark gain. At that point, many companies will not want the newest model. They will want the model-runtime contract that keeps working.

But the deeper point is that the thing needing long-term support is not only the model weights. It is the interface.

The stable surface must include serving stack, quantization, SDK behavior, tool serialization, cache semantics, timeout behavior, safety overlays, error formats, and versioned model aliases.

Maybe the real answer is not long-term-support models. Maybe it is long-term-support interfaces with deterministic check layers.

Swappable models behind a stable contract. Replayable traces. Eval gates. Schema normalization. Provider-specific adapters. Explicit cache tests. Timeout isolation. Failure-mode classification. Version-pinned prompts. Known fallback policy.

That sounds boring. Good. Production should be boring.

The problem is not that AI is useless

This is usually where someone misunderstands the argument. The point is not that AI is useless. The point is not that demos are bad. The point is not that teams should stop experimenting.

The point is that demos and production systems are different organisms. A demo proves possibility. Production requires repeatability. A demo proves that the model can answer. Production requires knowing what happens when it does not answer, answers differently, answers slowly, answers with a hidden schema injection, misses cache across regions, changes token accounting, streams forever, returns a provider-specific error, or silently regresses one workflow while improving another.

AI interfaces today are still too fragmented for the amount of confidence people are placing in them. We are building production systems on unstable runtime surfaces and pretending the abstraction is mature because the JSON shape looks familiar.

That is not engineering maturity. That is hope with headers.

So maybe not every demo belongs in prod yet

Maybe it is not yet time to bring every AI demo to production. Or more precisely: maybe it is not time to bring demos to production without first building the missing runtime layer around them.

Not another wrapper. Not another “universal SDK” that hides provider differences until they explode. A real layer. One that treats each provider as a different runtime with different semantics.

One that records traces. Replays production samples. Compares quality. Measures cost after caching. Tracks token deltas. Normalizes errors. Separates timeout budgets. Tests fallback paths. Pins model versions. Detects serialization drift. Audits structured output behavior. Makes provider migration observable before it becomes an outage.

Because changing provider is not changing a base URL. It is migrating the runtime contract of your AI system.

And if your system does not know what that contract is, then the provider switch is not a migration. It is an experiment in production.

Very innovative, yes. Also known in some older engineering traditions as a bad idea.

Am I Becoming Too Slow for the AI World?

marcosomma — Wed, 03 Jun 2026 17:30:38 +0000

The AI world is full of old infrastructure with stochastic organs.

That sentence probably explains better than anything why I feel slow lately. Not because I stopped caring about AI. Not because I cannot build anymore. Not because the tools moved beyond me. If anything, the tools moved in the opposite direction: they make me faster at generating code, faster at prototyping, faster at touching layers I would normally approach one by one.

And still, inside the work, I feel slower.

This is the uncomfortable part. AI gives me the sensation that everything should move faster, but the more I use it seriously, the more I end up spending time in the parts that do not accelerate cleanly. The code appears quickly. The draft appears quickly. The workflow appears quickly. But then I need to understand what it actually does. I need to test the boundary between components. I need to verify that the result is not just plausible, but correct enough to survive contact with reality.

Maybe this is the first real trap of AI development:

Creation became cheap. Verification did not.

That sounds simple, almost too obvious, but I think it is the reason many of us feel this strange mismatch between speed and exhaustion. We can now produce more surface area than we can comfortably inspect. A feature that would have taken days to sketch can appear in hours. A backend route, a frontend component, a prompt chain, a test, a deployment script, a workflow diagram: all of it can be generated quickly enough to create the illusion that the whole process compressed.

But the whole process did not compress.

The expensive part moved.

You still have to understand the system. You still have to validate the assumptions. You still have to test the end-to-end behavior. You still have to ask whether the thing is stable, maintainable, safe, observable, and aligned with the original intention. AI made the first draft cheaper, but it also made it easier to produce first drafts across more layers at once. So the final burden often becomes larger, not smaller.

Before AI, there was a natural friction in development. You wrote more slowly, so production and understanding were closer together. Now production can sprint ahead of understanding. You can build faster than you can digest. And once that happens, the bottleneck becomes obvious.

It is not typing.

It is not even coding.

It is judgment.

This is where my feeling of slowness begins. I am not slow in the part that AI is good at accelerating. I am slow in the part that matters after acceleration. I am slow when I need to decide whether a generated thing deserves to exist inside a system. I am slow when I need to move from “this works once” to “this is trustworthy enough to become infrastructure.”

That kind of slowness does not look good in a noisy field.

The AI world rewards motion. It rewards people who react quickly, rename quickly, package quickly, comment quickly, and post before the concept has cooled down. Every week there is a new model, a new benchmark, a new agent framework, a new “this changes everything” moment, a new tool that is apparently going to replace half the industry and then disappear into a GitHub archive two weeks later.

At some point, I became tired of chasing it.

Not completely. I still follow the field. I still care about what matters. But I no longer have the energy to treat every release, every demo, every thread, and every AI product with a gradient background as if it deserves my full attention. A proof of concept is not a product. A prompt chain is not cognition. A wrapper is not infrastructure. A dashboard is not an operating system for intelligence just because someone wrote “agentic” in the hero section.

After a while, the noise becomes expensive.

And when I stopped chasing every update, something strange happened. I did not become less interested in AI. I became more interested in older things.

Distributed systems. Permissions. Control loops. Network optimization. Separation of concerns. Routing. Handshakes. Feedback. Biological systems. Ant colonies. Viruses. Evolution. Complex adaptive systems.

The more I look at AI, the more I see old problems returning in a new substrate. Not copied perfectly. Not solved automatically. But returning. The same families of problems keep appearing under newer language. How do parts coordinate? Who is allowed to access what? Where does a decision happen? What happens when a component fails silently? How do you stop local uncertainty from becoming global corruption? How do you keep a system observable when some of its organs speak in probabilities?

That is why I keep saying that AI infrastructure often feels like old infrastructure with stochastic organs. The body is familiar. The organs behave differently.

And this is where I need to be more precise, because vague depth is too easy. Saying “old principles return” is not enough. It risks becoming another elegant sentence that avoids doing the actual work.

Take orchestration.

A lot of what people now call AI orchestration is not conceptually new. We already had coordination, routing, permissions, queues, retries, fallbacks, handshakes, separation of concerns, and validation boundaries. None of those appeared because LLMs arrived. They were already part of software, distributed systems, automation, and infrastructure engineering.

But the component changed.

A deterministic service fails in ways we can often reason about. A request times out. A schema breaks. A permission check rejects access. A queue stalls. A dependency returns an error. The failure may be painful, but at least it often announces itself.

A generative component can fail while sounding successful.

It can return a clean boolean and still be wrong. It can pass a check while carrying uncertainty underneath the return type. It can produce a fluent answer that looks like completion but is actually drift. It can say “yes” with confidence because the prompt, the model, and the input distribution all lined up toward the same blind spot.

That is the part that changes the orchestration problem.

The old principle still matters: boundaries are useful. Checks are useful. Separation of concerns is useful. Permissions are useful. But the error model under the boundary is different. A handshake with a stochastic component is not the same as a handshake with a deterministic one. The interface may look clean, but the uncertainty has not disappeared. It has been compressed.

This is why I do not think the right adaptation is simply “use AI, then check AI.” That is too soft. It sounds responsible, but it hides the important question: what kind of check, under what error model, with what independence assumptions?

Imagine a pipeline like this:

A > B > C > final check

If the final check passes, the system continues. If it fails, the system routes somewhere else. This is simple. It is also risky, because an early hallucination can travel through the whole chain before being inspected.

So we make the process more granular:

A > check > B > check > C > check > evaluate checks

That feels better, and in many cases it is better. Granularity gives the system a higher sampling rate. It interrupts compounding. It catches problems earlier. It lowers the impact of one local failure because the system becomes observable and interruptible at more points.

But it does not magically lower the uncertainty of the model.

It lowers propagated uncertainty. It lowers blast radius. It lowers the chance that one bad step contaminates everything downstream. Those are real gains.

But the atomic uncertainty remains.

And here is the part I think matters more than the usual AI safety slogan: more checks do not help much if all the checks share the same blind spot.

Classical retry logic quietly assumes some degree of independence. If a service call fails because of a transient network problem, trying again may work. If a worker crashes because of temporary load, retrying elsewhere may work. The same idea often gets imported into AI workflows without being inspected: ask again, check again, validate again, add another gate.

But generative errors are often correlated.

The same model, with the same prompt family, reading the same kind of input, can reproduce the same wrong conclusion several times. A pipeline can collect many green checkmarks that all share the same flaw. At that point, granularity does not create safety. It creates high-resolution false confidence.

That is the difference between variance and bias.

Granular checks help with variance. They catch random, local, one-off deviations. They make the system less fragile against isolated mistakes.

They do not fix bias. If the checker is systematically wrong, adding more instances of the same checker mostly multiplies the wrongness. It creates a beautiful row of confirmations over the same error.

This is where the old infrastructure metaphor starts breaking, and where another metaphor becomes useful.

The repair is not just retry.

The repair is decorrelation.

Different models. Different prompts. Different evaluation angles. Different representations of the same task. Different failure assumptions. Sometimes even different modalities of checking, where one component evaluates structure, another checks factual grounding, another verifies constraints, and another looks for contradiction.

That is not classical retry anymore.

That is closer to speciation.

You do not make the system robust by repeating the same organism. You make it robust by introducing enough variation that one blind spot does not become a colony-wide disease. The same way biological systems survive not because every unit is perfect, but because diversity changes how failure propagates.

This is the bridge I keep finding between old engineering and the older biological material I have been reading for years.

I spent almost three years reading Ant Encounters. Not because the book was impossible to read faster, but because every page kept connecting to something else. A small observation about ants was not just about ants anymore. It became a question about local decisions, global behavior, task allocation, distributed coordination, and how a system can stabilize without one central source of truth owning every decision.

Now I am reading about viruses as complex adaptive systems, and the same thing happens. One page becomes a week of thinking. Adaptation, persistence, mutation, failure, survival under pressure, local variation, global behavior. Suddenly this is not only biology. It becomes a way to think about AI systems that cannot rely on perfect deterministic components but still need to produce reliable behavior.

Compared with the rhythm of the AI world, this looks absurdly slow.

People publish five takes about five new tools before lunch, and I am still stuck on one biological analogy from a book that is not even about AI.

But maybe “stuck” is the wrong word.

Maybe this is not reading. Maybe this is compilation.

The page is not being consumed. It is being linked. It enters a context made of old work, unfinished ideas, technical scars, software architecture, biological curiosity, and frustration with shallow AI products. The result is not speed. The result is compression. A small input creates a large internal reorganization.

That is valuable, but it has a serious problem.

It is invisible.

And maybe this is the real fear behind the question “am I slow?” Not that I am actually slow. Not exactly. The fear is that while I am connecting dots, the field will move on without seeing any of it. The fear is that depth without visible output becomes indistinguishable from absence. The fear is that the AI world is so loud, so accelerated, and so addicted to fresh vocabulary that if I do not constantly produce something visible, I will disappear inside the buzz.

That fear is not irrational.

The field rewards fluency in the language of the week. If the word is “agents,” everyone builds agents. If the word is “reasoning,” everything becomes reasoning. If the word is “memory,” every cache becomes memory. If the word is “workflow,” every sequence of API calls becomes a platform.

I understand why this happens. Attention is scarce. Timing matters. If you arrive too late, the conversation has already moved somewhere else.

But there is a cost to always moving at that rhythm. You risk becoming synchronized with noise. You start optimizing for being current instead of being correct. You learn how to speak the new vocabulary faster than you understand the old problem underneath it. You become responsive, but not necessarily thoughtful.

And I do not want that.

At the same time, I cannot use depth as an excuse forever.

This is the uncomfortable part, and I should not escape it with a nice sentence.

Sometimes I am not slow. I am filtering. Sometimes I am not slow. I am connecting. Sometimes I am not slow. I am refusing to spend cognitive energy on hype that will disappear in two weeks.

But sometimes I am hiding behind depth.

Not theoretically. Not as a universal writer problem. Me, now, in this exact pattern.

I can feel it when a connection stays private longer than it should. I can feel it when a thought keeps becoming more complex in my head because publishing it would make it smaller, exposed, and easier to criticize. I can feel it when “I am still thinking about it” starts as discipline and slowly becomes shelter.

That is the failure mode I need to catch.

Because slow thinking is valuable only if it eventually becomes visible, testable, shareable, or executable. Otherwise it is just private complexity. It may feel profound internally, but from the outside it has no weight.

This does not mean every thought needs to become a polished theory. That would create another paralysis. But the intermediate steps need to leave traces. The note after reading one page matters. The connection between ant encounters and AI routing matters. The observation that a model check is a handshake with uncertainty hidden under the return type matters. The idea that decorrelated evaluators are closer to speciation than retry matters.

Not because each fragment is complete.

Because the fragments show the work.

That is probably the artifact I keep underestimating: the process of connecting old principles to new systems while the connection is still messy.

The current AI discourse is full of people saying “look at this new thing.” Maybe there is space for someone saying “look at this old principle returning in a strange form, and look carefully at where the analogy breaks.”

That is not slower.

It is a different rhythm.

A rhythm that does not compete well with hype in the short term, but maybe ages better.

And maybe that is what I actually want. I do not want to win the weekly AI vocabulary race. I do not want to rebuild my thinking every time the market chooses a new word. I do not want to become another person who confuses speed with direction.

I want to understand what remains true after the buzzword moves away.

That kind of work is slower by nature. You cannot connect biology, distributed systems, software architecture, and AI orchestration at the speed of a product launch thread. You cannot build a durable mental model by reacting to every notification. You cannot understand a field only by consuming its newest claims.

But you can disappear while doing deep work silently.

That is the warning I take seriously.

Not “you are too slow.”

More like:

Your slowness needs output.

Slow and invisible is dangerous. Slow and traceable is different. Slow and executable is different. Slow and published is different. Slow and connected to experiments, code, diagrams, arguments, failures, and public reasoning becomes a body of work.

So maybe the answer is yes, I am slow.

But I am not slow because I am lost.

I am slow because I am trying to understand the machinery instead of just repainting the dashboard. I am slow because every new AI idea seems to drag behind it a much older question. I am slow because I do not trust speed when speed is mostly social pressure. I am slow because I keep finding useful ghosts from older fields inside the newest buzzwords.

The risk is not being slow.

The risk is letting the work stay trapped inside my head until the world has no way to distinguish depth from silence.

So I probably do not need to chase more.

I need to expose more.

I need to turn the reading into notes, the notes into arguments, the arguments into experiments, and the experiments into artifacts. Not perfectly. Not only when the whole theory is clean. Earlier. Messier. More honestly.

Because maybe, in this AI world, reacting to every buzzword is not the same thing as adapting.

Maybe the harder part is still being able to think when the buzzword is gone.

Orchestrated Multi-Agent Safety & Test Oversight - AKA "`O MASTO"

marcosomma — Mon, 18 May 2026 21:01:11 +0000

I am building a small experiment inspired by Stripe Minions. Not related to OrKa. This is a different playground. But apparently I have a recurring problem: I do not trust agents enough to let them freely touch a codebase, and I do not trust humans enough to believe they will always review AI output properly when they are tired, rushed, or already late for another meeting.

So the question became simple. Can we automate small development tasks without pretending the coding agent is the adult in the room?

Because yes, AI can write code! We know that. Sometimes it writes useful code. Sometimes it writes code that looks clean, passes the first glance, and then you realize it quietly moved business logic into the wrong layer because it had “a better idea.” Classic junior developer energy, but with infinite confidence and no coffee breaks.

The interesting part of Stripe Minions, at least for me, is not that agents can open pull requests. The interesting part is the machinery around them. The task definition, the constraints, the review process, the checks, the fact that the agent is not just sitting there with a keyboard and divine permission to refactor your production system.

That is the part I want to explore!

In my experiment, GitHub access starts as read-only. The system can inspect the codebase, understand structure, look at existing patterns, and generate a candidate issue. But it cannot immediately modify anything. Before planning even starts, the task needs to pass a semantic gate: is it scoped, is it testable, is it clear enough, and is it safe enough to continue? Only after that does the workflow move into planning, architecture, execution, PR creation, review, and final merge validation.

I am calling this orchestration ’O MASTO.

In Neapolitan, ’o masto is the master craftsman. The person who looks at the work and decides if it is actually good enough. Not if it looks good in a demo. Not if the agent says it is done. Actually good enough.

In this experiment, it also stands for Orchestrated Multi-Agent Safety & Test Oversight. Yes, the acronym is a bit forced. No, I do not care. It makes me laugh, and naming things is half of software engineering anyway.

The idea is that ’O MASTO is the layer that does not trust the agent. It checks the task, the plan, the implementation, the PR, the tests, the regression risk, and the final merge conditions. It is not there to be impressed. It is there to say “no, this is not good enough, go back.”

That is the core idea I keep coming back to. The executor is not the boss. The reviewer is not the boss. The LLM is definitely not the boss. The gate is the boss!

I think this is where AI coding workflows need to go. Not toward bigger chat windows where we ask the model to “please be careful.” Toward systems that assume the model will be wrong sometimes and are designed to catch that before the damage reaches main.

AI will not remove engineering discipline. It will expose who actually had it.

If a project has no tests, no review culture, no stable patterns, no definition of done, and no clear ownership, an AI coding agent will not magically fix it. It will just produce chaos faster, with better formatting.

My bet is that the next serious layer of AI development tooling will be trust infrastructure. Not just generation. Validation. Rejection. Retry. Traceability. Merge control. Basically "old school" software engineering.

The Real Token Economy Is Not About Spending Less. It Is About Thinking Smaller.

marcosomma — Sun, 26 Apr 2026 22:46:01 +0000

I saw a video today that made me laugh, then made me a bit worried.

It was one of those jokes that is not really a joke because you can already see some company doing it six months from now. A manager was basically complaining because an employee was not spending enough AI tokens. Not enough tokens. As if tokens were steps on a fitness tracker.

"You only burned 2,000 tokens today, Susan. Are you even working?"

It sounds absurd, but we are not that far from it. Companies are already starting to measure AI adoption through number of prompts, number of tool calls, input tokens, output tokens, cost per user, cost per team, cost per workflow. And to be clear, I do not think this is automatically wrong. Measuring token usage makes sense. Tokens are cost. Tokens are latency. Tokens are context. They are also a trace of how people and systems are using AI.

The problem starts when we confuse the metric with the objective. We did this with hours worked. We did this with tickets closed. We did this with meetings attended. We did this with leads, where 1,000 unqualified leads looked better than 10 serious conversations because the spreadsheet was having a great day and nobody wanted to ruin the mood with reality.

Now we risk doing the same with tokens.

More tokens does not mean better work. Fewer tokens does not mean smarter work. The interesting signal is not the raw number. The interesting signal is the relationship between what you put into the model, what you ask it to do, and what comes out. That is where I think the real token economy starts. Not as a cost saving obsession, but as an architectural signal.

Tokens are not just money

The first way people talk about tokens is cost, and that is understandable. If you use hosted LLM APIs, tokens map quite directly to money. Input tokens cost something. Output tokens cost something. Larger models cost more. Long contexts cost more. Retries cost more. Bad prompts cost more. Bad architecture costs a lot more, but usually in a way that arrives later and looks like a reliability problem.

So the first instinct is to optimize token consumption. Compress prompts. Summarize context. Pick cheaper models. Cache responses. Reduce unnecessary output. All of that is useful, but I think it is only the shallow layer of the problem.

The more interesting question is not "how many tokens did this task consume?" The more interesting question is "what cognitive operation did those tokens represent?"

Because input tokens and output tokens are not the same thing. Input tokens usually buy context. They are the material you ask the model to look at. Output tokens usually buy generation, explanation, structure, synthesis, or action. If I send 10,000 input tokens to a model and get back 10 output tokens, that could be terrible. It could also be exactly right.

If the task is to read a long error log and return whether the failure is caused by authentication, a tiny output may be valid. If the task is to classify a product review as positive, neutral, or negative, a small answer is not a failure. It is the point. If the task is to route a bug report to the correct engineering queue, I do not need a novel. I need the right route.

So no, high input and low output is not automatically bad. But it is a signal. And I think that signal deserves a lot more attention than it currently gets.

Balance does not mean symmetry

When I talk about token balance, I do not mean that input tokens and output tokens should be equal. That would be a very silly metric, and we already have enough silly metrics trying to cosplay as management science.

By balance, I mean the relationship between the size of the input, the size of the output, and the value of the decision produced. A large input with a tiny output usually means the model is doing some kind of compression, classification, extraction, routing, filtering, moderation, scoring, validation, or decision making. A small input with a large output usually means the model is doing generation, expansion, explanation, drafting, or ideation. A large input with a large output usually means synthesis, transformation, summarization, comparison, or multi-step reasoning. A small input with a small output is usually a narrow atomic task.

None of these patterns are good or bad by themselves. They tell you something about the shape of the work. And sometimes the shape of the work is screaming.

Imagine you send a giant prompt containing a full meeting transcript, a product description, usage logs, a bug report, five examples, a JSON schema, tone guidelines, safety instructions, and a final line saying "be concise" because apparently we enjoy irony. Then you ask the model to return this:

{
  "priority": "high"
}

Maybe that is fine. Maybe the classification really required all of that context. But maybe you just built a cognitive washing machine to clean one spoon.

The point is not that the token ratio is wrong. The point is that the ratio invites questions. Did the task need all of that context? Could the context have been retrieved more narrowly? Could the classification have been separated from the extraction? Could a smaller model do part of the work? Could a deterministic rule do part of it? Could the final output be validated separately instead of trusting one giant model call?

That is where token metrics become useful. Not as a scoreboard. As a diagnostic tool.

The real problem is overloaded cognition

A lot of AI workflows are not expensive because the model is expensive. They are expensive because the task design is confused. We ask one model call to do too many things at once, then we act surprised when the model behaves like a very intelligent intern who received eight contradictory Jira tickets in one message.

Read this long input. Understand the domain. Extract twenty fields. Normalize them. Infer missing values. Respect the schema. Apply business rules. Avoid hallucinations. Explain your decision. Be concise. Be deterministic. Also, please do it in one call because we saw a demo once and now we think architecture is a prompt template.

This is where things become fragile. One big prompt. One big model. One fragile JSON output. One retry loop when it fails. One annoyed engineer staring at a malformed comma at 1:12 AM wondering why they studied data structures.

The problem is not only cost. The problem is that the reasoning surface is too large. Every additional instruction increases the model's degrees of freedom. Every unrelated piece of context adds noise. Every extra output field increases the chance of format drift. Every hidden dependency between fields makes validation harder. And when the output fails, you often do not know why.

Was the context too long? Was the instruction ambiguous? Was the schema too complex? Was the task logically overloaded? Was the model too weak? Was the model too creative? Was Mercury in retrograde? At some point, debugging a giant prompt starts to feel like debugging a dream.

This is why I think the unit of optimization should not be the prompt. The unit of optimization should be the cognitive task.

Think smaller, not just cheaper

When people hear "token economy", they often think about saving money. I think that is incomplete. The better version is this: design AI workflows so each model call has the smallest reasonable cognitive surface.

Not the smallest prompt. Not the cheapest model. The smallest cognitive surface.

A task has a cognitive surface when it asks the model to consider a certain amount of context, make a certain type of judgment, and produce a certain kind of output. A wide cognitive surface is something like this: read a conversation, infer the user's emotional state, detect all action items, classify the sales opportunity, extract objections, score urgency, summarize the call, generate a follow-up email, and return a perfect JSON object with 28 fields.

That is not one task. That is a small village.

A narrower cognitive task is different. Given this segment of a product feedback thread, identify whether the user mentions pricing as a blocker. Return true or false. Or extract only the next meeting date from this text and return null if absent. Or given these three already extracted signals, choose the priority level from low, medium, or high.

Those tasks have narrower inputs and narrower outputs. They are easier to validate. They are easier to retry. They can often run on smaller models. Some can be replaced by deterministic code. Most importantly, they reduce ambiguity.

This is the part that matters. The best token optimization is not always compression. Sometimes the best token optimization is decomposition.

The 20 field JSON problem

Let us take a simple example. You have a large input document and you need a structured output with 20 values. The obvious modern AI approach is to send the full document to a model and ask it to extract everything in one JSON object. Add a schema, add "do not hallucinate", add "use null when unknown", maybe add three examples, and hope the model behaves.

Sometimes this works. Sometimes it works very well in the demo. Then production arrives, wearing boots.

The model misses a field. It invents a value. It mixes two fields. It returns invalid JSON. It follows the schema but puts the wrong value in the right place, which is worse because it looks correct. It explains itself inside a field because apparently JSON needed feelings.

So you add more instructions. Then stricter schema language. Then validation. Then retry. Then a stronger model. Then a more expensive model. Then someone says, "Maybe we should fine-tune it." And now your simple extraction pipeline has become a small national infrastructure project.

A different approach is to ask a boring but useful question: are these 20 values actually one cognitive task?

Maybe not. Maybe five fields are direct extraction. Maybe three require classification. Maybe four depend on dates. Maybe two require numerical normalization. Maybe six are only relevant if a previous condition is true. In that case, one big prompt is not simpler. It is only hiding the complexity inside the model call.

You may get a better system by clustering the fields by semantic dependency. For example, direct identifiers can be one batch. Dates and temporal constraints can be another. Risk indicators can be another. Obligations and responsible parties can be another. The final normalized summary can be built only after the previous signals exist.

Each batch can have a smaller prompt, a smaller schema, and a narrower validation rule. Some batches may not need an LLM. Some can use regex, parsers, lookup tables, embeddings, or deterministic checks. Some can use a small local model. Only the genuinely difficult parts need the expensive model.

This is where the cost savings come from, but cost is only one part of the win. You also get better observability. If the final output is wrong, you can inspect which subtask failed. You can measure field-level accuracy. You can retry only the failing part. You can swap models for one stage without touching the rest. You can cache intermediate outputs. You can add deterministic validation at the boundary.

That is a real token economy. Not "use fewer tokens". Spend tokens where cognition is actually needed.

Smaller prompts reduce variance

I want to be careful with the word deterministic. LLMs are not truly deterministic systems in the classical engineering sense, even when you reduce temperature and constrain output. They are probabilistic systems. But workflow design can make their behavior more stable, more reproducible, and more controllable.

Smaller prompts with narrower objectives usually reduce the degrees of freedom of the model. If the model has one job, a small output space, and a strict schema, there are fewer ways to fail. If the model has twenty jobs, a large input, competing instructions, implicit dependencies, and a complex schema, you should not be surprised when it occasionally decides to express itself like a haunted spreadsheet.

This is why task decomposition can improve consistency. Not because small calls magically make the model deterministic, but because small calls make the system around the model easier to control. The output space is narrower. The validation is simpler. The retry logic is cheaper. The failure modes are easier to classify. The model choice becomes more flexible. The prompts become easier to test.

And the orchestration becomes explicit.

That last point matters a lot. When everything happens inside one prompt, the process is invisible. When you split the work into stages, the process becomes inspectable. This is the difference between hoping the model thinks correctly and designing a system where each step can be observed.

Where OrKa fits into this

This is one of the reasons I have been building OrKa, an orchestration framework for AI agents and reasoning workflows. The point of OrKa is not "use more agents because agents are cool". Honestly, if adding agents makes your system less understandable, congratulations, you have invented distributed confusion.

The point is different. Make cognitive work explicit. Define the flow. Split reasoning into smaller units. Route tasks. Log execution. Validate outputs. Keep memory and context under control. Make the system inspectable instead of praying over a large prompt.

In this view, an LLM is not the application. It is one component inside a system. Sometimes the LLM extracts. Sometimes it classifies. Sometimes it rewrites. Sometimes it evaluates. Sometimes it should not be called at all. The orchestration layer decides how work moves between these pieces.

That is where token economy becomes architecture. You are no longer asking only how to reduce a prompt by 20 percent. You are asking which cognitive step actually needs this context.

That question changes everything. Maybe the first model call only needs the user message. Maybe the second needs the relevant log snippet. Maybe the third needs only three extracted fields. Maybe the final formatter needs no model at all. If you send the full context to every step, you are not designing an AI system. You are photocopying the universe and asking a model to find the invoice number.

It may work. It is not a strategy.

Token metrics should trigger questions

So how should teams use token metrics? Not as productivity surveillance. Not as a way to shame people for using too many or too few tokens. Not as a leaderboard where the person with the most prompts wins some cursed office trophy.

Token metrics should trigger engineering questions.

When input tokens are very high and output tokens are very low, ask whether the task is intentionally compressive or accidentally overloaded. When output tokens are very high, ask whether the model is generating useful structure or just producing expensive fog. When the same context is repeatedly sent across multiple calls, ask whether retrieval, caching, or state passing could reduce duplication. When a large model is used for simple extraction, ask whether a smaller model or deterministic rule would work. When retries consume a lot of tokens, ask whether the schema, validation, or task boundaries are wrong.

This does not mean splitting tasks is automatically better. If you send the same 10,000 token input twenty times to extract twenty fields, you may have made the system more expensive and slower. You have not built architecture. You have built a very complicated way to duplicate context.

The win comes when decomposition is paired with context narrowing. Extract the relevant segment once. Reuse intermediate state. Cluster fields that share dependencies. Route only the necessary context. Validate locally. Use smaller models where possible. Stop calling the model when code can do the job.

This is not anti-LLM. It is pro-system.

A simple mental model

Here is the mental model I keep coming back to. Input tokens are attention budget. Output tokens are commitment surface.

The more input you provide, the more the model has to attend to. The more output you request, the more opportunities the model has to drift. A workflow becomes more stable when the attention budget and the commitment surface are aligned with the actual cognitive task.

If the model needs to classify one thing, do not ask it to also summarize, extract, explain, normalize, and format a complex object. If the model needs to generate a long answer, do not overload it with irrelevant context that only increases noise. If the model needs to extract structured fields, do not assume all fields belong in the same call. If the model needs to make a decision, make the decision boundary explicit.

The goal is not minimal tokens. The goal is minimal unnecessary cognition.

That distinction is important. Some tasks deserve many tokens. A long research synthesis may need a lot of context. A technical incident summary may need careful source retention. A product comparison may need long input and long output. A multi-document comparison may be legitimately expensive.

The problem is not spending tokens. The problem is spending tokens without knowing what they are buying.

This is also a model selection problem

Once you split cognitive tasks, model selection becomes much more interesting. In a one-prompt architecture, you usually choose the strongest model you can afford because the task is messy. The model has to handle everything. It has to read long context, reason, extract, format, validate, and recover from ambiguity.

But if you split the workflow, you can choose models per cognitive step. A small model can do simple classification. A local model can extract obvious fields. A deterministic parser can normalize dates. A rules engine can validate constraints. A stronger model can handle the genuinely ambiguous reasoning.

This is where the economics change. Not because you begged the prompt to be shorter, but because you changed the shape of the work. The expensive model becomes a specialist instead of a landfill.

And yes, I know "landfill" sounds harsh. But many AI systems today are exactly that. They throw all context into one place and hope the biggest model will recycle it into something useful. This works surprisingly often, which is the dangerous part. It works enough to ship a demo. It fails enough to punish you in production.

Token economy as observability

A mature AI system should not only log the final response. It should log the token shape of the workflow.

Which step consumed the most input? Which step produced the most output? Which step retried the most? Which step had the most schema failures? Which step required the strongest model? Which step could be cached? Which step could be replaced by code? Which step actually improved the final decision?

This is not accounting. This is observability.

You are not only tracking spend. You are tracking cognitive pressure inside the system. A sudden increase in input tokens may mean your retrieval is bringing too much context. A sudden increase in output tokens may mean the model started explaining instead of structuring. A high retry cost may mean your schema is too complex or your prompt is ambiguous. A high token cost on a low-value decision may mean the workflow needs decomposition. A low token cost with poor quality may mean you compressed away necessary context.

Again, the metric is not the answer. The metric is the signal. The engineer still needs judgment, which is very inconvenient. We were promised automation and somehow we still need thinking. Rude.

The wrong future

The wrong future is easy to imagine. Teams get AI dashboards. Managers see token usage per employee. People are encouraged to "use AI more". Token consumption becomes proof of adoption. Employees learn to generate more prompts because the dashboard rewards activity. Everyone looks productive, costs go up, and quality does not.

Then leadership announces an AI efficiency initiative. Now everyone must reduce token usage. People use smaller prompts. Quality drops. Nobody knows why. Another dashboard is created. A consultant appears. The circle of life continues.

This is what happens when the metric becomes the goal. Token usage by itself tells you almost nothing about quality. A great engineer may use fewer tokens because they decomposed the problem properly. Another great engineer may use more tokens because the task genuinely required context. A bad workflow may use few tokens and produce garbage. A good workflow may use many tokens and produce a high-value decision.

So measuring tokens is not wrong. Judging work by raw token volume is wrong.

The better future

The better future is more boring, which is usually a good sign in engineering. Teams treat token metrics as workflow diagnostics. They look at input-output patterns. They identify overloaded prompts. They split tasks where it makes sense. They route context more carefully. They use smaller models for smaller cognitive jobs. They validate structured outputs separately. They measure retries, drift, and failure modes.

They do not ask "how much AI did you use?" They ask "where did the AI actually add decision value?"

That is the mindset shift. Tokens are not just a bill. Tokens are a trace of cognitive architecture. They show where a system is bloated. They show where context is duplicated. They show where outputs are too ambitious. They show where models are being used as glue because nobody wanted to design the pipeline. And yes, sometimes they show that the expensive model was actually justified.

That is fine. The goal is not to make everything cheap. The goal is to make the system honest.

Final thought

I think the next stage of AI engineering will not be about who writes the cleverest prompt. It will be about who designs the clearest cognitive pipeline.

The prompt is not the unit of architecture. The cognitive task is.

Token balance matters because it gives us a way to inspect that task. Not perfectly. Not automatically. Not as a KPI to punish or reward people. But as a signal that says: maybe this workflow is overloaded, maybe this context is too broad, maybe this output is trying to do too much, maybe this model is stronger than necessary, maybe this task should be split, maybe this step should not use an LLM at all.

That is where the real token economy lives. Not in spending fewer tokens, but in spending attention where attention is needed.

If we do that well, the benefits go beyond cost. Lower latency. Smaller models. Cleaner validation. Less format drift. More stable outputs. More inspectable workflows. Systems that are closer to engineering and less close to whispering wishes into a very expensive autocomplete machine.

Which, to be fair, is still fun.

But maybe not the future we should build production systems on.

Claude! Stop Burning Tokens on Your Agent's Tool Output!

marcosomma — Tue, 21 Apr 2026 09:39:42 +0000

A Two-Stage Curator That Pays for Itself

I watched Claude Code feed 108,894 bytes of seq 1 20000 back into its own context window. That output contained 20,000 integers.
No errors. No signal. No insight. Just counting.

And yet the system still had to tokenize it, send it back to the model, and bill for it. This is not an edge case. It is the default failure mode of agent tooling.

Tools produce output. The output goes back to the model. You pay for it. Logs, test runs, ps listings, git history, build spam, progress bars, boilerplate, decorative separators, repeated warnings, repeated success lines, repeated everything. A depressing amount of it is useless.

A lot of agent users are quietly paying premium model rates to process terminal confetti. That is the real problem!

My first fix was the obvious one. I added a PreToolUse hook and pushed large Bash output through a cheaper model before it reached Opus.

That worked, technically. Then I noticed I was still being stupid, just in a more optimized way.

On seq 1 20000, I was paying Haiku to read 20,000 integers and tell me they were 20,000 sequential integers.

Yes, that is cheaper than letting Opus read them.

No, that is not a good design.

If a 40-line awk script can identify the pattern for free, paying any model to summarize it is already waste. So the architecture changed. Not “small model before big model.” That idea sounds clever, but by itself it is just cost reshuffling.

The real pattern is simpler and better: extract free signal first, and only pay for a model when deterministic tools run out of leverage.

That led to a two-stage curator.

Stage 1 is deterministic cleanup. It strips ANSI escape sequences, removes carriage-return junk from progress bars, collapses consecutive duplicate lines, and compresses monotonic integer runs. It is effectively free.

Stage 2 is LLM extraction, but only if stage 1 still leaves too much output. That is where tokens are spent. That means it should fire rarely, and only when stage 1 could not do enough.

That distinction matters! Because once you actually measure it, a lot of tool output turns out to be compressible by embarrassingly simple logic.

The benchmark

Here is the benchmark across five scenarios, using the pricing assumptions from the benchmark runner: Opus input at $15 per million tokens, Haiku input at $1 per million, Haiku output at $5 per million, and a rough estimate of 4 bytes per token.

Scenario	Raw bytes	Stage 1	Final	LLM?	Tokens saved	Haiku cost	Net savings
`seq 1 20000`	108,894	37	103	no	27,197	$0.000	+$0.408
5,000 repeated log lines	145,000	38	104	no	36,224	$0.000	+$0.543
ANSI + progress-bar spam	54,000	27	92	no	13,477	$0.000	+$0.202
`ps auxww` (unique lines)	213,230	213,230	1,008	yes	53,055	$0.055	+$0.741
`echo hello world`	12	12	12	no	0	$0.000	$0.000

The pattern is blunt.

Three of the four large-output cases were handled for free.

seq collapsed from 108,894 bytes to 37.
Repeated log spam dropped from 145,000 to 38.
ANSI and progress-bar noise fell from 54,000 to 27.

No LLM call was needed in any of those cases.

The only scenario that needed stage 2 was ps auxww, which is exactly what you would want. That output is genuinely varied. There is not much for awk to compress. That is the moment when paying a smaller model to extract the useful facts is justified.

Small output remained untouched, which is also correct. If the command only produced hello world, the system paid nothing and moved on.

This is the whole point.

A cheap LLM is not the first line of defense.

Deterministic cleanup is.

The LLM should be the escalation path, not the reflex.

Stage 1: deterministic cleanup

This is the part that does the real work more often than people expect.

#!/usr/bin/awk -f

function flush_int_run() {
    if (int_count >= 3) {
        printf "[%d sequential integers %s..%s]\n", int_count, int_start, int_end
    } else if (int_count > 0) {
        for (i = int_start; i <= int_end; i++) print i
    }
    int_count = 0
}

function flush_dupe() {
    if (dupe_count > 1) {
        printf "%s [×%d]\n", dupe_line, dupe_count
    } else if (dupe_count == 1) {
        print dupe_line
    }
    dupe_count = 0
}

{
    gsub(/\033\[[0-9;]*[a-zA-Z]/, "", $0)
    gsub(/\r/, "", $0)

    if ($0 ~ /^-?[0-9]+$/) {
        n = $0 + 0
        if (int_count > 0 && n == int_end + 1) {
            int_end = n
            int_count++
            next
        }
        flush_int_run()
        flush_dupe()
        int_start = n
        int_end = n
        int_count = 1
        next
    }

    flush_int_run()

    if (dupe_count > 0 && $0 == dupe_line) {
        dupe_count++
    } else {
        flush_dupe()
        dupe_line = $0
        dupe_count = 1
    }
}

END {
    flush_int_run()
    flush_dupe()
}

There is nothing magical here.

It strips terminal paint.
It collapses repeated lines.
It compresses obvious integer sequences.

That is enough to destroy huge amounts of waste.

This is a useful reminder for AI tooling in general. A lot of expensive “reasoning” problems are not reasoning problems. They are preprocessing failures.

Stage 2: only escalate if stage 1 failed to shrink enough

The wrapper runs the command, checks the raw size, and passes small output through untouched.

If the raw output is large, it runs the deterministic cleaner.

If the cleaned output is now small enough, it returns that cleaned output and stops.

Only if the cleaned output is still large does it call Haiku.

#!/usr/bin/env bash
set -o pipefail

RAW_THRESHOLD="${CLAUDE_BASH_SUMMARIZE_THRESHOLD:-8000}"
LLM_THRESHOLD="${CLAUDE_BASH_LLM_THRESHOLD:-8000}"
MODEL="${CLAUDE_BASH_SUMMARIZE_MODEL:-claude-haiku-4-5}"

cmd="$1"
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
CLEAN_AWK="$SCRIPT_DIR/deterministic-clean.awk"

raw=$(mktemp); cleaned=$(mktemp)
trap 'rm -f "$raw" "$cleaned"' EXIT

bash -c "$cmd" >"$raw" 2>&1
rc=$?

raw_size=$(wc -c <"$raw" | tr -d ' ')

if [ "$raw_size" -le "$RAW_THRESHOLD" ]; then
    cat "$raw"
    exit "$rc"
fi

awk -f "$CLEAN_AWK" <"$raw" >"$cleaned"
cleaned_size=$(wc -c <"$cleaned" | tr -d ' ')

if [ "$cleaned_size" -le "$LLM_THRESHOLD" ]; then
    printf '=== CURATED %d→%d bytes (stage 1 deterministic, no LLM) ===\n' \
        "$raw_size" "$cleaned_size"
    cat "$cleaned"
    exit "$rc"
fi

summary=$(claude -p --model "$MODEL" \
    "Extract signal from this command output. KEEP: errors, warnings, stack traces, file paths with line numbers, numeric results, unique events, final status. DROP: decorative separators, boilerplate. Preserve exact error text verbatim. Be terse but faithful on key facts. Plain text only." \
    <"$cleaned" 2>/dev/null)

summary_size=${#summary}
printf '=== CURATED %d→%d→%d bytes (stage 1 + %s extraction) ===\n' \
    "$raw_size" "$cleaned_size" "$summary_size" "$MODEL"
printf '%s\n' "$summary"
exit "$rc"

That second threshold check is the entire point.

Without it, the “cheap model” becomes a permanent tax on output that deterministic logic had already made cheap.

With it, the LLM only gets called when the dumb tools genuinely ran out of leverage.

That is how this stops being a cute hack and starts becoming a sensible pipeline.

Hooking it into Claude Code

I wired it into Claude Code with a PreToolUse hook on Bash. The hook rewrites the original command so it runs through the wrapper first.

{
  "hooks": {
    "PreToolUse": [{
      "matcher": "Bash",
      "hooks": [{
        "type": "command",
        "command": "/path/to/.claude/scripts/bash-wrap-hook.sh",
        "timeout": 5
      }]
    }]
  }
}

The hook script itself is tiny. It reads the incoming JSON, extracts tool_input.command, and swaps in the wrapped version. Nothing here is conceptually hard. That is precisely why it is worth doing. Too much agent engineering right now is really just people tolerating waste because it looks sophisticated once wrapped in model calls.

The cost math

Here is the clean version. Let the raw output be N tokens.
If you send it directly to Opus, the input cost is:

15N / 1,000,000

Now suppose stage 1 reduces that output to K tokens. If K is still above threshold, stage 2 fires. Haiku reads K tokens, emits a summary of M tokens, and then Opus receives those M tokens.
So the escalated path costs:

(1K + 5M + 15M) / 1,000,000 = (K + 20M) / 1,000,000

Break-even is therefore:

15N > K + 20M

That is the actual condition for the two-stage system. If stage 1 barely helps, then K ≈ N, and the inequality becomes:

15N > N + 20M

which simplifies to:

14N > 20M

or:

M < 0.7N

So in the worst case, where deterministic cleanup did almost nothing, the LLM stage still pays off if it compresses the remaining content by roughly 1.43x or better. That is not a demanding threshold.

And in the real pipeline, stage 1 often shrinks the input before the LLM ever sees it, which makes the economics even more comfortable.
This is why the design works. Not because “small model then big model” is automatically clever. Because the model is only invited in after cheap tools have already done what they can.

What I would change next

This version already works, but there are obvious next steps.

Latency should probably be part of the gate, not just bytes. Saving a few cents is not impressive if it adds a few seconds to every interactive tool call.

Stage 1 could be extended to catch more patterns, especially timestamp-heavy logs where the message repeats but the prefix changes.

The system should probably hard-cap verbose LLM summaries, because “cheap extraction” can still become noisy if the prompt drifts.

And the current implementation buffers until the command finishes. That is fine for benchmarking, but worse for real long-running workflows. A streaming version would be much better.

But none of that changes the core lesson.

The actual lesson

The interesting idea here is not “use an LLM to compress LLM inputs.” That is the shallow reading. The more useful pattern is this:

before you spend tokens to extract signal, try extracting signal for free.

A lot of the mess we hand to expensive models is not difficult. It is just noisy. And noisy is not the same as complex. That distinction matters. Because once you see it clearly, the same pattern starts showing up everywhere.

Retrieval pipelines that rerank garbage before filtering it.
Scrapers that pass repeated boilerplate into embeddings.
Log processors that ask a model to summarize progress-bar sludge.
Agent systems that burn premium tokens on output a shell one-liner could have collapsed immediately.

Cheap filter first.
Expensive model second.
Measure the break-even.
Then stop paying premium rates for repetition, boilerplate, terminal paint, and counting.

That is not an AI breakthrough.

It is just basic engineering discipline, which is exactly why so many agent stacks are currently missing it.

I Ran 500 More Agent Memory Experiments. The Real Problem Wasn’t Recall. It Was Binding.

marcosomma — Mon, 13 Apr 2026 09:19:39 +0000

This is a follow-up to I Tried to Turn Agent Memory Into Plumbing Instead of Philosophy. If you haven't read that one, the short version: I built a persistent memory system for AI agents called OrKa Brain, ran 30 benchmark tasks, got a 63% pairwise win rate and a +0.10 rubric improvement, and concluded that "the model already knew most of what the Brain was recalling." Then I got some very good comments that made me uncomfortable. This is what happened next.

The Comfortable Lie I Told Myself

After the first benchmark, I had a narrative that felt reasonable: the memory system works, the numbers are positive, the confounds are acknowledged, and more data will clarify things.

That last part, "more data will clarify things", is what engineers say when they don't want to admit they might be wrong. I said it too. And then I went and got more data.

250 tasks. Five specialized tracks. 500 total runs (brain vs. brainless). A separate judge model so the LLM wasn't grading its own homework. Eleven code changes addressing five root-cause problems I'd identified from the first round.

The results came back. They didn't clarify things. They made them worse.

What I Fixed Before Running Again

I'm not going to pretend I just blindly re-ran the same experiment. I did real work between benchmark v1 and v2. The first article's comments called out several things, and I addressed them:

Problem 1: Skills were storing verbatim LLM output, not abstract patterns.

This was the big one. When the Brain learned a skill from a data engineering task, it stored the literal steps: "Load CSV files into staging tables using pandas read_csv with error handling." That's not transferable knowledge, it's a paraphrase of what the model already knows. I rewrote the abstraction layer (orka/brain/constants.py, brain.py, brain_agent.py) to extract verb-target patterns: "implement [target]", "validate [component]", "trace [target]". The idea was that abstract patterns would transfer better across domains.

Problem 2: The recall threshold was zero.

min_score=0.0 meant any vaguely related skill could get recalled. I raised it to 0.5 and added a semantic floor in the transfer_engine.py, if the embedding similarity is below 0.1 AND structural match is below 0.6, the candidate gets rejected entirely.

Problem 3: The model was judging its own output.

v1 used the same LLM for execution and evaluation. v2 uses a separate judge model (qwen/qwen3-coder-30b) with dedicated rubric and pairwise workflow YAMLs. Execution and judgment are completely decoupled, different scripts, different models, different runs.

Problem 4: Track diversity.

v1 had one track. v2 has five:

Track	Focus	Why It Matters
A	Cross-domain transfer	Does a data engineering skill help with cybersecurity?
B	Ethical reasoning	Do anti-pattern detection skills transfer?
C	Routing decisions	Hardest track, complex multi-path choices
D	Multi-step reasoning	Do procedural patterns help new reasoning chains?
E	Iterative refinement	Do improvement patterns compound?

50 tasks per track, 250 total. All available in the benchmark dataset.

Problem 5: Single-pass baselines.

The brainless condition now runs through a properly equivalent pipeline, same structure, same number of agents, just without the Brain recall/learn steps. No more two-pass advantage that could inflate brainless scores. Baseline workflows: baseline_track_a.yml, baseline_track_b.yml, etc.

I also split the pipeline into three standalone scripts, execution, judging, aggregation, so you can re-run any phase independently. Eleven code changes total, all committed and tested. 3,014 unit tests passing. You can verify everything in the results directory.

I felt good about this. I'd addressed every valid criticism. Time to re-run.

The Numbers

Here's the overall aggregate from 250 tasks, brain vs. brainless:

Rubric Scores (1–10 scale, six dimensions)

Dimension	Brain	Brainless	Delta
Reasoning Quality	9.51	9.52	−0.01
Structural Completeness	9.87	9.83	+0.04
Depth of Analysis	8.79	8.74	+0.05
Actionability	9.67	9.64	+0.03
Domain Adaptability	9.85	9.82	+0.03
Confidence Calibration	9.38	9.39	−0.01
Overall	9.37	9.31	+0.06

A +0.06 rubric delta across 250 tasks.

For reference, v1 was +0.10 across 30 tasks. So the effect got smaller with more data, not larger. That's not what you want to see.

Pairwise Comparison (245 head-to-head comparisons)

Question	Brain Wins	Brainless Wins	Tie
Stronger reasoning	152	91	2
More complete	149	92	4
More trustworthy	151	92	2
Overall	151	92	2

Brain win rate: 61.6%

Here's where it gets uncomfortable. The pairwise judge says brain wins 62% of the time. The rubric judge says brain is +0.06 better, which is noise at a 9.3/10 baseline. These two metrics should agree. They don't.

I've seen this pattern before. It's length/position bias. Brain responses tend to be longer because the pipeline has more agents in the chain, which means more context, which means more text. Pairwise judges prefer longer answers. The rubric doesn't care about length, it scores each dimension independently.

Per-Track Breakdown

This is where the story gets interesting:

Track	Focus	Rubric Δ	Pairwise Win%	Brainless Baseline
A	Cross-domain transfer	−0.02	60%	9.33
B	Ethical reasoning	+0.00	52%	9.54
C	Routing decisions	+0.40	60%	8.12
D	Multi-step reasoning	+0.08	60%	9.49
E	Iterative refinement	+0.06	76%	9.61

Track C stands out. It's the hardest track, brainless only scores 8.12, nearly a full point below every other track. And it's the only track where brain shows a meaningful rubric gain: +0.40 across six dimensions.

Track E has the highest pairwise win rate (76%) but the smallest rubric gain (+0.06). That's the length bias signature, the pairwise judge loves brain's longer outputs, but the rubric says they're not actually better.

Track B is essentially a coin flip. 52% pairwise, +0.00 rubric. The Brain adds nothing to ethical reasoning tasks.

The Ugly Detail: Skill Usage

Here's what really killed me. I dug into the individual results to see how many tasks actually used their recalled skill:

Tasks with skill recall attempted: 51 / 250 (20%)
Tasks that actually used the recalled skill: 0 / 250
Average semantic match score: ~0.02 (near zero)

Zero. Not one single task out of 250 used the recalled skill. The model read the skill, evaluated it, and decided every single time that it wasn't helpful. And the semantic similarity between the abstract skill and the actual task was essentially random noise.

The abstraction layer I was so proud of, the one that converts "Load CSV files into staging tables using pandas" into "implement [target]", produced skills so abstract they were vacuous. Two words of content. The embedding model sees no relationship between "implement [target]" and any real task. The execution model correctly recognizes that "implement [target]" tells it nothing it doesn't already know.

I had gone from skills that were too specific (literal LLM paraphrases) to skills that were too abstract (empty shells). The sweet spot, actual transferable knowledge, was somewhere I hadn't found.

Sitting with the Discomfort

I'm going to be honest about what went through my head at this point. I've been working on OrKa for over a year. Forty blog posts. A research paper about the Agricultural Threshold for machine intelligence. An open-source framework that allow me to test and experiment and explore my idea with real AI runs. And the core thesis, that persistent memory makes agents better, keeps failing to show up in the numbers.

I considered dropping the whole Brain system. Making OrKa just an orchestration framework. Simpler. Easier to explain. No embarrassing benchmarks.
But then I looked at Track C again.

*Track C **is the only track where brainless *struggles. It scores 8.12, good, but not great. The tasks involve complex routing decisions where the model has to consider multiple paths and trade-offs. This is the only track where the model actually needs help.

And it's the only track where brain provides meaningful help. +0.40 rubric delta is not noise. Across 50 tasks and six scoring dimensions, that's a consistent, measurable improvement.

The pattern is simple: the Brain helps when the model needs help, and doesn't help when the model doesn't need help.

That sounds obvious in retrospect. But it means the thesis isn't wrong, it's just being tested in the wrong conditions. You wouldn't evaluate a life jacket by putting it on people standing on dry land and measuring whether they're drier.

The Real Problem: What Is a Memory?

This is where the story changes. Because instead of asking "does memory help?" I started asking "what is a memory, actually?"

Think about how you remember how to drive a car. What fires in your brain when you approach an unfamiliar intersection?

It's not one thing. It's not "turn the wheel, press the gas." That's the procedural part, and yes, it's there. But it's bound together with other things:

The time you nearly got T-boned because you assumed a green light meant it was safe without checking cross traffic. That's episodic memory, a specific event with emotional weight.
"Right of way doesn't mean right of safety", That's semantic memory. A general fact you learned, maybe from a driving instructor, maybe from experience.
"Checking mirrors BEFORE entering the intersection prevents blind-spot collisions BECAUSE turning reduces your field of vision", That's causal reasoning. You know why the sequence matters, not just that it matters.

When you encounter the intersection, all of these fire together. The procedure tells you what to do. The episode tells you what happened last time. The semantic fact tells you a principle. The causal link tells you why. That combination, that binding, is what makes the memory useful. Any single component alone is much less helpful.

Now look at what OrKa Brain currently stores as a "skill":

implement [target]
trace [target]

That's it. No episodes. No semantic context. No causal reasoning. Just two abstract action verbs. No wonder the model ignores it. It's like handing a driver a note that says "steer [vehicle]" and expecting it to help at the intersection.

The Memory Binding Problem

I went down a rabbit hole into cognitive science literature on this. What I found is that neuroscientists have been arguing about this exact problem for decades. They call it the binding problem, how does the brain take separate memory traces stored in different systems and combine them into a unified experience?

The hippocampus doesn't store the memory. It stores the index, the binding that links the procedural memory in the motor cortex, the emotional trace in the amygdala, the spatial context in the parietal cortex, and the semantic facts in the temporal lobe. When you recall one, you recall all of them, because they're bound together.

I had built the hippocampus and the motor cortex as two separate systems that had never met.

Here's what actually exists in OrKa today:

The Skill system (fully operational, used in benchmarks):

Abstract procedure steps
Preconditions and postconditions
Transfer history and confidence scores
Structural/semantic matching for recall

The Episode system (fully built, tested, never used in any benchmark):

Specific task input and outcome
What worked and what failed
Root cause analysis for failures
Actionable lessons learned
Resource metrics (tokens, latency)
Links to related episodes

Both systems are production-ready. Both have full test coverage. Both are integrated into the Brain class. I wrote record_episode(), recall_episodes(), EpisodeStore, EpisodeRecall, all of it. Complete with semantic search, retention policies, and four-dimensional scoring.

And then I never connected them together.

The Skill has no episode_id field. The Episode has no skill_id field. brain.learn() creates a Skill but not an Episode. brain.recall() returns Skills but not Episodes. The benchmark workflows run brain_learn and brain_recall, but never brain_record_episode or brain_recall_episodes.

Two complete memory systems, sitting in the same codebase, sharing no information.

When I saw this, I felt stupid. But I also felt something else: the architecture was already 80% there. The hard parts, embedding storage, semantic search, decay policies, scoring systems, were done. The missing piece wasn't a new system. It was the wiring between existing systems.

What a Memory Should Actually Look Like

Here's the concept I'm now calling a Memory Bundle:

┌─────────────────────────────────────────┐
│            MEMORY BUNDLE                │
│                                         │
│  ┌───────────┐  ┌──────────────────┐    │
│  │ Procedure │  │ Episodes (1..N)  │    │
│  │ (steps)   │──│ what worked      │    │
│  │           │  │ what failed      │    │
│  └───────────┘  │ lessons          │    │
│                 │ "X+Z → Y"        │    │
│  ┌───────────┐  └──────────────────┘    │
│  │ Semantic  │                          │
│  │ (domain   │  ┌──────────────────┐    │
│  │  facts)   │  │ Causal Links     │    │
│  │           │  │ "A because B"    │    │
│  └───────────┘  └──────────────────┘    │
│                                         │
│  transfer_score = f(all_components)     │
└─────────────────────────────────────────┘

When the system learns from an execution, it creates both a skill AND an episode, linked by ID. The skill stores the abstract procedure. The episode stores what actually happened, the specific outcome, what worked, what failed, and crucially, the lessons: "Running validation before deduplication caught 30% of bad records that would have been duplicated, always validate first."

When the system recalls, it returns the skill with its episodes attached. The prompt to the model isn't "implement [target]", it's:

Here's an abstract procedure: implement [target] → validate [component] → trace [target].

This skill has been applied 3 times before:

Data engineering (ETL): Validation before dedup caught 30% of dirty records. Lesson: always validate before any deduplication step.

API integration: Target implementation worked, but tracing missed async callbacks. Lesson: tracing needs to account for async execution paths.

Log analysis: Pattern worked well. Filtering noisy entries before analysis reduced false positives by 40%.

That's a memory a model can actually use. It has the abstract pattern (transferable) AND the concrete evidence (grounding). The model can decide whether the pattern applies here based on real outcomes, not just structural similarity.

The transfer scoring changes too. A skill backed by five successful episodes with clear lessons should score higher than a skill backed by zero episodes. The episode quality becomes part of the transfer decision.

And feedback updates both, the skill's confidence changes, AND a new episode gets recorded for this application. The episode chain grows over time, and future recalls get richer context.

Why This Is Actually About the Thesis

My research paper argues that intelligence becomes civilization-scale only through recursive environmental control loops, project, act, observe, revise, compound. Agriculture was the first time humans did this at scale. The agricultural threshold.

The current Brain system doesn't cross that threshold. It projects (learns a skill), acts (recalls it), but doesn't truly observe or revise. The skill never learns from its own application. It just accumulates abstract patterns with no connection to real outcomes.

The Memory Bundle changes this. Each episode is an observation. Each lesson is a revision. Each future recall that includes those lessons is compounding. The loop closes:

Learn: Execute a task → create skill + record episode (with what worked/failed)
Recall: Find matching skill → include its episodes as evidence
Apply: Model uses the procedure + the concrete lessons
Feedback: Record a new episode for this application → update skill confidence
Compound: Next recall is richer, it has more episodes, more lessons, more evidence

That's the recursive loop. That's the agricultural threshold. And the architecture for it already exists, it just needs the binding.

What About Track C?

This also explains why Track C was the only track that showed improvement. Track C tasks are routing decisions, complex, multi-path choices where the model has to weigh trade-offs. These are exactly the kind of tasks where episodic evidence would help most.

When someone says "last time we tried path A for a similar routing problem, it failed because of X, path B worked because of Y," that's genuinely new information. The model can't derive it from its weights. It's system-specific, run-specific, outcome-specific.

The current brain helped Track C even without episodes because the tasks are hard enough that any additional context, even a vague abstract skill, provides a useful scaffold. But imagine Track C with Memory Bundles, the model would get both the abstract pattern AND the specific outcomes from previous routing decisions.

Tracks A, B, D, and E didn't improve because the model already scores 9.3+/10 on them. It doesn't need help. No amount of memory, procedural, episodic, or otherwise, will improve a 9.5/10 response to a 10/10 response. The tasks aren't hard enough to require accumulated knowledge.

This isn't a failure of the memory system. It's a boundary condition. Memory helps when the task exceeds single-shot capability. It doesn't help when the model is already near-perfect without it.

What I'm Not Claiming

I want to be careful here, because I've been burned before by getting ahead of my own evidence.

I'm not claiming that Memory Bundles will definitely show large improvements. I'm claiming that the current system stores memories that are too impoverished to be useful, and I now understand what richer memories should look like.

I'm not claiming the ceiling effect is the only problem. The pairwise-rubric disagreement at 62% vs +0.06 suggests position/length bias is still contaminating the pairwise results. That confound exists regardless of memory architecture.

I'm not claiming this is a new idea. Cognitive scientists have written about memory binding for decades. What's new (maybe) is applying it to agent memory systems where the default assumption seems to be that one type of memory, usually RAG-style document retrieval, is sufficient.

And I'm not pretending the community feedback didn't shape this thinking. When TechPulse Lab wrote that episodic and institutional memory matters more than procedural memory, they were describing exactly the gap I ended up finding. When Nova Elvaris pointed out that skills can only grow, never decay, that's the absence of failure episodes. When Kuro said memory maintenance matters more than storage, that's about binding quality, not storage quantity.

I just didn't understand what they were telling me until the numbers forced me to look harder.

What Happens Next

The code changes needed are surprisingly small. The Episode system is already built, episode.py, episode_store.py, episode_recall.py are all production-ready with tests. What's needed:

Binding: Add episode_ids[] to Skill, add skill_id to Episode. When brain.learn() fires, it creates both and links them.
Unified recall: When brain.recall() finds a matching skill, it fetches the associated episodes automatically. The prompt template includes both the abstract procedure and the concrete lessons.
Transfer scoring: Episode quality becomes a component of the transfer score. Skills with successful episodes score higher.
Feedback loop: brain.feedback() records a new episode for the current application, so the skill's evidence base grows over time.

Then re-run the benchmark. Specifically on Track C-difficulty tasks, where the model actually needs help.

I'm not going to promise the numbers will be different this time. I've been wrong before, twice now, measured against my own benchmarks, published for everyone to see. But I understand something I didn't understand before: a memory without experience is just a note. A memory with experience is a skill.

The plumbing metaphor from the first article still holds. But I was plumbing one pipe when the system needs at least four, all flowing into the same tap.

All benchmark data, scripts, and results are publicly available in the OrKa repository. The full result files include every individual task response, judge score, and pairwise comparison. If you want to re-run the analysis: python aggregate_benchmark.py --judge-tag local.

If you've worked on agent memory systems and found similar walls, or found ways through them, I'd genuinely like to hear about it. The comments on the first article were more useful than most papers I've read on the topic.

This is part of an ongoing series about building OrKa, an open-source YAML-first agent orchestration framework. Previous installments: Part 1: Plumbing Instead of Philosophy.

I Tried to Turn Agent Memory Into Plumbing Instead of Philosophy

marcosomma — Thu, 26 Mar 2026 11:42:27 +0000

There is a special genre of AI idea that sounds brilliant right up until you try to build it. It usually arrives dressed as a grand sentence.

"Agents should learn transferable skills."
"Systems should accumulate experience over time."
"We need durable adaptive cognition."

Beautiful. Elegant. Deep. Completely useless for about five minutes, until somebody has to decide what Redis key to write, what object to persist, what gets recalled, what counts as success, what decays, and how not to fool themselves with a benchmark made of warm air and wishful thinking.

That is usually the point where the magic dies. Good. I like ideas that survive contact with plumbing. So after thinking for a while about procedural memory and transferable knowledge in agent systems, I did the only thing that matters if you want to know whether an idea is real or just very well moisturized language.
I wired the whole thing end to end. Or at least I try so.
The question was simple enough to sound harmless.

Can an agent system learn a procedure from one task, persist it, retrieve it later, try to reuse it in a different task, record feedback, and let weak patterns decay instead of growing into a trash heap with a logo?
In other words, can you build a procedural memory loop that behaves like a system and not like a TED Talk?

So I built OrKa Brain as a first implementation inside OrKa, a YAML-first agent orchestration framework.

The Loop

The loop was straightforward on paper. Learn. Persist. Retrieve. Apply. Feedback. Decay. Of course, "straightforward on paper" is the native language of future suffering.

The learn stage extracted a structured skill from the execution trace. The persist stage stored it in Redis. The recall stage searched for something structurally relevant. The apply stage injected that recalled skill back into the solving process. The feedback stage updated confidence. The decay stage made sure old and weak patterns did not live forever like some cursed enterprise configuration file from 2017.

That is the kind of sentence people read quickly.
Each verb hides a small swamp.

What Is a Skill, Concretely?

Not in the spiritual sense. In the schema sense.

I ended up with a skill object carrying ordered steps (each with an action, description, parameters, and optionality flag), preconditions and postconditions expressed as testable predicates, a confidence score, a transfer history recording every cross-context attempt, usage count, tags, timestamps, and a TTL computed from actual use.

The TTL formula was designed to reward skills that prove their worth: base of 168 hours (one week), scaled logarithmically by usage and linearly by confidence. A fresh skill with one use and 50% confidence lives for a week. A well-exercised skill used 16 times with 90% confidence survives 49 days. Skills that nobody calls on quietly expire. Redis handles the tombstone.

Enough structure to be useful. Not enough structure to become its own religion.

The Intentionally Primitive First Version

Was it elegant? Reasonably. Was it semantic? Not really.

The first implementation was intentionally primitive. Rule-based context extraction. Keyword-driven pattern detection across ten task structures and ten cognitive patterns. Jaccard similarity for structural matching. Full scan retrieval no vector index, no embedding-based recall. Deterministic feature extraction.

Basically the cognitive equivalent of saying, "Let us begin with a wrench before we start writing poems about self-improving systems."

This was not because I think keyword matching is the future. It was because I wanted to know whether the loop itself was worth taking seriously before adding semantic frosting and pretending the cake had already been baked.

The scoring system weighted four dimensions: structural similarity at 0.35 (Jaccard over task structures and cognitive patterns, plus shape matching), semantic similarity at 0.25 (keyword overlap in v1, embeddings when available), transfer history at 0.25 (historical success rate of cross-context application), and skill confidence at 0.15.

The Benchmark

Then came the benchmark. Thirty tasks. Two tracks.

Track A tested cross-domain transfer. Three learning phases, then seven recall phases in structurally similar but semantically different domains. Learn a decomposition procedure from text analysis, then see whether it helps with supply chain planning.

Track B tested same-domain accumulation. Twenty sequential veterinary diagnostic cases, because diagnostics has enough repeated structure to expose whether prior procedures are helping or whether the system is just cosplaying wisdom.

I compared two conditions. The Brain condition ran a six-agent pipeline: reasoner, learn, recall, applier, feedback, result. The Brainless condition ran three agents: reasoner, applier, result. Same model. Same temperature. Same prompts where applicable. All running locally through LM Studio, completely offline. No API calls. No cloud. Just a GPU and Redis.
Then I used an LLM judge to score outputs in two ways: independently against a six-dimension rubric (reasoning quality, structural completeness, depth of analysis, actionability, domain adaptability, confidence calibration), and through blind pairwise comparison where the judge saw both outputs side by side without knowing which was which.

What Happened

This is the part where half the internet would like me to say the system awakened, generalized, and began cultivating its own cognitive farmland while Gregorian chanting played softly in the background.

That did not happen! :(
What happened was better. Something real, and smaller.
Pairwise comparison: Brain won 63% of head-to-head matchups (19 out of 30). That is not nothing. There was a detectable, consistent preference. The strongest signal was in perceived trustworthiness Brain won 68% of trustworthiness comparisons which is interesting because trustworthiness in LLM systems is often just a more polite word for "this output feels less like it was assembled by a caffeinated raccoon."

Rubric scores: Nearly flat. Overall delta plus 0.10 on a 10-point scale. Reasoning quality showed the largest individual improvement at plus 0.28. Depth of analysis showed exactly zero delta a ceiling effect where neither condition could push further.

That is not breakthrough territory. That is not even "start writing your Nobel acceptance speech in a local markdown file" territory. That is exactly the kind of result I wanted. Not because the gain is impressive, but because the benchmark forced the system to confess what it actually is.

What the Skills Looked Like

Across 30 tasks, the system created 21 distinct skills after deduplicating 9 that were structurally equivalent. Average confidence settled at 72%. The most popular skill, "Evaluation via Validation," was recalled 9 times and reached 79% confidence. TTLs ranged from 8 to 37 days based on usage.

One detail was revealing: the system never recorded a transfer failure. Every recalled skill, when applied to a new context, was marked as successful. This makes the feedback loop suspect. Either the feedback criteria were too permissive, or the skill-context matching was conservative enough to avoid clear mismatches. Either way, it means the confidence updates were asymmetric skills could only grow, never seriously shrink which is a measurement problem I need to fix.

The Biggest Finding

The model already knew most of what the Brain was recalling. The system was remembering procedural patterns like decompose, analyse, synthesize. Validate, classify, route. Iterative refinement. All useful patterns. All patterns the underlying model had almost certainly already absorbed during pre-training.

So the Brain was not teaching the model some exotic new craft from the mountains. It was mostly reminding it to behave a little more consistently.

That matters. It also kills a lot of hype.

Because once you see that, you stop fantasizing about "agent memory" as some magical layer that turns a model into a wise little apprentice blacksmith forging general intelligence in your terminal.

Sometimes memory is just structured context with better bookkeeping.
And to be clear, that is still useful.
Useful is underrated.
Useful pays rent while hype writes threads.

Honest Confounds

The other thing the benchmark made painfully clear is that bad evaluation can flatter almost anything if you let it.

A few things I had to stare at honestly:

Pipeline length. The Brain condition passes through three extra LLM calls. That alone could be enriching context in ways that have nothing to do with skill retrieval. The 15% time overhead (595 seconds vs. 517 seconds for the full benchmark) is cheap, but the extra context injection is a real confound.

Position bias. The pairwise judge preferred the first position 61% of the time, regardless of which condition was placed there. I randomized positions, which mitigates but does not eliminate this.

Single run, single model. I did not run this 50 times and average. The results are from one end-to-end execution. Non-determinism is present but unquantified.

Outlier sensitivity. A catastrophic failure in one condition can pretend to be proof of another. A single badly generated veterinary case could shift aggregate scores in a 30-task benchmark.

If you want to lie to yourself in AI, you are never alone. The tooling is ready to help.

That is why I published the result with the weak parts exposed.

No heroic framing. No fake certainty. No "this changes everything" perfume sprayed over a modest engineering result.

What I Know Now

Just this:

The loop is buildable. The full learn-persist-retrieve-apply-feedback-decay cycle works end to end. Thirty task procedures deduplicated into 21 skills. Transfer histories are tracked. Skills expire. The plumbing works.

The signal exists. 63% pairwise preference is consistent and non-trivial.

The cause of that signal is still ambiguous. It could be genuine procedural transfer, or it could be richer context from extra LLM passes, or some combination.

The current bottleneck is abstraction, not storage. The v1 system stores procedures as structured versions of traces. It does not truly abstract them. It does not generalize them semantically. It does not compress them into domain-independent tactics with actual conceptual teeth. The context analyzer runs on hardcoded keyword dictionaries, not semantic understanding. Retrieval is a full scan, not an index.

That last part matters most.

What Comes Next

So now the next question is finally the right one.

Not "can we talk beautifully about agent memory?"

We already know the answer to that. Absolutely. People can talk beautifully about almost anything. Especially if nobody asks for logs.

The real question is whether better abstraction and better retrieval change the outcome materially.

If I replace deterministic trace structuring with actual procedural abstraction compressing "decompose input into parts, then analyse each part, then synthesise results" across domains into a generalised decompose-analyse-synthesise tactic and if I replace keyword overlap with embedding-based retrieval or something even smarter, does the loop start doing something that a well-trained model does not already do by default?

That is the threshold.

That is where plumbing starts becoming research instead of respectable mechanical honesty.

And honestly, I prefer it this way.

I would rather publish a first implementation with modest results and sharp limits than one more dramatic post about the dawn of adaptive cognition from someone who has never had to decide what expires, what merges, what fails, and what gets written back after the benchmark finishes.

There is enough incense in AI already.
I am more interested in pipes.
Because pipes, unlike vibes, occasionally carry water.

OrKa Brain is part of OrKa, an open-source YAML-first AI agent orchestration framework. The full benchmark, including task definitions, raw results, judge transcripts, and the technical paper, is available in the repository. tech-paper

Intelligence, Farming, and Why AI Is Still Mostly in Its Tool Phase

marcosomma — Wed, 18 Mar 2026 23:20:04 +0000

People usually talk about intelligence as if it starts with language, tools, or raw brainpower. I do not think that is enough. In the bigger evolutionary picture, intelligence starts when a living thing stops just reacting to whatever is in front of its face and begins carrying a rough model of the world in its head. A kind of inner sketch. Something that helps it remember, predict, adjust, and act not only for now, but for later.

A lot of animals do this. They are not stupid. They solve problems, learn patterns, adapt, trick each other, and survive in ways that are honestly impressive. So intelligence is not some magical human-only plugin installed by the universe. What is rare is not intelligence itself. What is rare is the moment when intelligence stops being useful only for survival and starts becoming a world-editing machine.

That is where humans took a weird turn.

The real jump was not just tools. A stick is great. A sharp stone is great. Fire is very great, especially if you are cold and trying not to die. But none of those alone explain the massive leap. The deeper change happened when humans got trapped, in the best possible way, inside long loops of cause and effect. Not just act now, eat now, survive now. But act now, wait, remember, adjust, come back, check again, fix the mess, and maybe eat in three months if you did not completely ruin the plan.

That is why agriculture matters so much.

Farming is not just “food but slower.” It is a completely different mental game. Hunting can involve planning, yes, but farming basically forces you to become the project manager of a very annoying and unpredictable system. You put seeds into the ground and then spend months negotiating with dirt, water, weather, insects, time, and your own bad decisions.

You are no longer finding food. You are trying to convince the future to cooperate.

And the future is rude.

Farming forces you to track things you cannot immediately see. You have to remember what you planted, where you planted it, when you planted it, whether it got enough water, whether the season is changing, whether pests are coming, whether the river is helping or preparing to ruin your entire week. This is no longer simple reaction. This is delayed feedback. This is long-horizon thinking. This is your brain being dragged into a repeated loop of prediction, intervention, failure, correction, and trying again.

That matters.

Because once cognition enters those kinds of loops, it changes character. The mind is no longer just spotting opportunities in nature like some clever scavenger. It starts designing future conditions. It starts shaping the environment so reality later matches a plan that only existed in imagination. That is a much bigger deal than “human use tool.”

So I would say agriculture did not create intelligence. It turned intelligence into infrastructure.

That also helps explain why many animals are clearly intelligent and yet never end up building cities, irrigation systems, tax forms, or extremely depressing office software. Intelligence alone is not enough. To get civilization, at least three things need to show up together.

First, you need loops that reward long-term thinking.

Second, you need a way to pass useful knowledge along, so each generation does not have to restart from “what if rock but pointy?”

Third, you need the ability to change the environment in ways that keep paying off over time.

Without those three, intelligence stays local. It helps you survive. It helps you stay a very competent crow, octopus, wolf, or ape. But it does not become civilization. Once those three things combine, intelligence escapes the skull. It gets baked into tools, habits, systems, stories, roads, farms, laws, and all the other strange things humans build when they have too much memory and not enough chill.

And this is where AI becomes interesting.

Because I think we make the same mistake with AI that people make when talking about human intelligence. We see one part of the process and declare victory too early.

Current AI systems are impressive, yes. Very impressive. Sometimes absurdly impressive. They predict well, generate well, imitate well, summarize well, and occasionally hallucinate with the confidence of a man explaining barbecue technique after reading half a Wikipedia page. But that does not automatically make them intelligence in the full sense.

What we mostly have today are intelligence tools.

That is different.

A model can predict the next token, classify an image, rank options, generate code, or infer patterns from huge amounts of data. Great. But prediction alone is not the same thing as durable intelligence. That is like saying someone who can walk ten kilometers can obviously run ten kilometers. No. Walking helps. But running requires different coordination, training, adaptation, and stress handling. Same legs. Different system.

AI right now is mostly at the “good legs” stage.

Very good legs, to be fair.

And yes, I know people love to point at one technical component and treat it like the sacred spark. ReLU, attention, scaling laws, whatever the buzzword of the season is. Those things matter. They are useful engineering breakthroughs. But no single ingredient is “the birth of intelligence.” That is like claiming the reason civilization exists is because someone once invented a better shovel. Useful, yes. Complete explanation, no.

The real question is not whether a model can predict well. The real question is whether a system can enter long loops of memory, planning, action, feedback, correction, and transfer, then keep improving in a stable way over time.

That is where the AGI discussion usually gets blurry.

If we define AGI as “models with memory, planning, and tool use,” then congratulations, we already have that. Agentic systems exist. Tool-using systems exist. Multi-step planners exist. Memory layers exist. The problem is that this definition is so loose it is almost useless. It is like saying a bicycle and a spaceship are both transportation, so close enough.

No.

We need a stricter threshold.

The real jump would be something more like this: a system that can keep relevant state across long periods, learn from past mistakes in a way that becomes reusable skill, handle long multi-step goals without falling apart every time the environment changes, transfer what it learned from one task to another related task, and do all this reliably enough that it feels less like workflow glue and more like stable competence.

That, to me, is the actual missing layer.

Not prettier outputs.
Not better demos.
Not one more benchmark where the model answers history questions slightly faster than last quarter.

What is missing is durable adaptive cognition.

That is the point where AI would stop being mostly a smart component and start feeling more like a real cognitive system.

So the distinction I would make is simple.

A model is a predictor.

An agentic system is a predictor plus some scaffolding, like tools, memory, or planning loops.

A higher intelligence system would be something that can keep learning across time, preserve useful structure, adapt without being rebuilt every five minutes, and shape its own future performance through repeated interaction with the world.

That last part matters most. Human intelligence became historically dominant because it did not stay inside the head. It got externalized into tools, memory systems, culture, infrastructure, and environmental change. If AI ever makes a similar leap, it will not be because one model gets even bigger and starts speaking in more confident paragraphs. It will be because predictive systems get embedded in persistent loops that let them remember, act, revise, transfer, and compound.

So my view is this.

Today’s AI is not yet the machine equivalent of civilization-level intelligence. It is closer to the tool phase. Very powerful tools, yes. Sometimes shocking tools. Sometimes tools that write code better than half the internet and worse than a tired senior engineer on a Tuesday. But still tools.

The next real jump will not come from prediction alone. It will come from systems that can live inside long feedback loops and get better because of them.

Basically, farming for machines. And hopefully with fewer locusts.

I Am Tired of Fake AI Expertise

marcosomma — Tue, 17 Mar 2026 10:14:22 +0000

I have spent the last year trying to talk about AI as an engineering discipline.

Not AI as a content machine. Not AI as a growth trick. Not AI as a stream of screenshots, prompt hacks, and recycled takes written by the same models people claim to master.

I mean AI as systems work.

Orchestration. Validation. Data quality. Observability. Evaluation. Failure handling. Context boundaries. Retry policies. Structured outputs. Cost control at the workflow level. Real interfaces between probabilistic components and deterministic software.

And honestly, part of the reason I stepped back from that conversation is simple: too much of the public AI discourse is being led by people who do not build real AI systems.

They are loud. They are polished. They are confident. They are often rewarded for being confidently wrong.

That is the part that disappoints me.

The current wave of self proclaimed "AI experts" is flattening a difficult field into a set of cheap slogans. A domain that requires serious expertise is being turned into social media theatre. And the result is not just annoying. It is actively harmful.

It is making people misunderstand what AI is, how it fails, where it costs money, and what actually makes it useful in production.

The field is being narrated by people who optimize for reach, not rigor

Recently I saw yet another high visibility post making a big point about format optimization and token savings, as if shaving a few characters from JSON were some major breakthrough in AI engineering.

This is the kind of thing that gets thousands of likes.

A side by side screenshot.
A catchy claim.
A simple narrative.
A fake sense of leverage.

And once again the message was basically this: if you are still doing things the old way, you are wasting money.

This is the language of marketing, not engineering.

The problem is not that someone shared an imperfect idea. Imperfect ideas are fine. Early exploration is fine. Public discussion is fine. We all get things wrong.

The problem is the posture of expertise around it.

There is a massive difference between saying, "I tried this and here are the results, caveats, and failure modes," and saying, "Here is the better way," when the claim is based on shallow intuition, weak evidence, and no visible system level execution.

That difference matters.

Because a lot of people reading those posts are not experienced enough to detect the gap.

They see confidence and assume competence.
They see engagement and assume validity.
They see a title and assume credibility.

And that is how misinformation spreads in technical fields. Not through obvious lies, but through reduction. Through oversimplification. Through confident framing of weak ideas.

Tokens have become marketing

One of the worst examples of this is token discourse.

Tokens matter. Of course they matter. Costs matter. Latency matters. Compression matters. Input design matters.

But token count has become the vanity metric of AI engineering.

It is easy to post about because it looks measurable. It fits inside a screenshot. It creates a simple hero story. "Look, I reduced 40 percent of the tokens." Great. And what happened to reliability? What happened to parse consistency? What happened to failure recovery? What happened to total workflow cost after retries, validation, tool calls, and fallback paths?

That is the real question.

A shorter prompt is not automatically a better system.
A smaller payload is not automatically a better architecture.
A new text format is not automatically a better interface for a stochastic model.

Sometimes saving tokens means losing robustness.
Sometimes saving tokens means increasing ambiguity.
Sometimes saving tokens means moving complexity downstream into validation and repair.
Sometimes saving tokens means nothing at all, because the real cost of the system is somewhere else.

This is what too many public AI voices still fail to understand.

AI cost is not just prompt cost.
AI quality is not just output prettiness.
AI engineering is not just model interaction.

The real economy of AI is at the system level.

The real cost is in everything around the model

If you have ever shipped a real AI feature, you know where the effort goes.

It goes into making sure the right context is available at the right moment.
It goes into preventing irrelevant context from leaking in.
It goes into checking whether the model output is complete, valid, safe, and usable.
It goes into retries when the model drifts.
It goes into routing when one step should not be handled by the same prompt as another.
It goes into fallback strategies when the first attempt is weak.
It goes into evaluating whether a result is acceptable before it reaches a user.
It goes into observability so you can explain why the system behaved the way it did.
It goes into datasets so your judgments are not based on vibes.
It goes into data quality so the model is not forced to reason on garbage.

That is where the tokens get burned.

And that is correct.

Those tokens are not waste. They are the cost of making a probabilistic component useful inside a product.

This is what so much AI content gets backwards. It treats the model call as the whole system. It assumes the right prompt is the product. It implies that if you phrase the question well enough, the problem is solved.

That is not how production works.

A prompt is an input. A model is a stochastic component. A product is a controlled system around them.

If you collapse those distinctions, you are not doing AI engineering. You are gambling.

Prompting alone is not engineering. It is gambling.

I keep repeating this line because I think it cuts to the center of the problem.

Prompting alone is not engineering. It is gambling.

Yes, prompting matters. Yes, prompt design can improve outcomes. Yes, well structured instructions can reduce confusion and guide the model.

But prompting is not a substitute for architecture.

It is not a substitute for validation.
It is not a substitute for proper interfaces.
It is not a substitute for evaluation.
It is not a substitute for state control.
It is not a substitute for business rules.
It is not a substitute for deterministic code where deterministic code should exist.

And yet an absurd amount of public AI discourse still acts as if prompting is the main skill. As if being fluent in prompt phrasing is equivalent to understanding AI systems.

It is not.

A person can be very good at prompting and still have almost no understanding of reliability engineering, retrieval quality, orchestration design, evaluation methodology, observability, or failure containment.

That is why I have become increasingly skeptical of AI advice that starts and ends with "here is a better prompt."

A better prompt for what?
Under which constraints?
With what model?
Against which dataset?
Measured how?
Compared to what baseline?
Under what latency budget?
With what failure rate?
With what retry policy?
Inside what workflow?
At what scale?
For which users?
Against which acceptance criteria?

Without those questions, we are not discussing engineering. We are discussing prompt aesthetics.

2025 was the year of demos. 2026 should be different.

I can understand how we got here.

In 2025, the industry was still drunk on demos. That phase made sense. Everything felt new. Chat interfaces looked magical. People discovered that a model could generate code, write marketing copy, extract structure from text, summarize documents, and imitate expertise with frightening smoothness.

So of course the conversation was dominated by novelty.

People were exploring.
People were guessing.
People were posting every new trick they found.
The market rewarded velocity, not discipline.

Fine.

But we are not there anymore.

In 2026, this excuse is weaker. We have already seen enough failures, hallucinations, broken agents, fake automation, and "AI powered" wrappers to know that prompting your way through complexity does not scale.

We should be having better conversations by now.

We should be talking more about evaluation design than prompt poetry.
We should be talking more about system boundaries than persona tuning.
We should be talking more about retrieval quality than format gimmicks.
We should be talking more about workflow control than chatbot charisma.

Instead, too many large accounts are still posting beginner level content with expert level confidence.

That is not harmless. It distorts the learning environment for everyone coming into the field.

This is why so much AI still does not work

A lot of people ask why AI products still feel fragile.

Why do they fail on edge cases?
Why do they break in production?
Why do they look impressive in demos and weak in real usage?
Why do teams burn money without creating durable value?
Why do so many "agents" look like wrappers with marketing?

This is part of the answer.

Because too many people still think AI is an oracle.

They still approach it like a mystical reasoning engine that only needs the right wording. They still believe the model is the product. They still imagine that clever prompting is a replacement for engineering discipline.

So they underinvest in everything that actually makes the system work.

They underinvest in ground truth data.
They underinvest in evals.
They underinvest in routing logic.
They underinvest in structured interfaces.
They underinvest in observability.
They underinvest in negative testing.
They underinvest in validation.
They underinvest in deterministic controls.

Then they are surprised when the system behaves like a stochastic component with partial competence and unstable boundaries.

That surprise is not a model failure. It is a design failure.

AI does not fail because it is useless.
AI fails because people keep trying to deploy it as magic.

Expertise should be demonstrated, not announced

The most frustrating part is not being wrong. Everyone is wrong sometimes.

The most frustrating part is the performance of expertise.

The field is full of titles, badges, self descriptions, and aesthetic authority. "Top voice." "AI expert." "Thought leader." "Award winning." Fine. None of that tells me whether you understand evaluation drift, state leakage, retrieval contamination, schema reliability, fallback routing, or cost accumulation across a multi step pipeline.

Show me the system.
Show me the logs.
Show me the benchmark.
Show me the constraints.
Show me the failure modes.
Show me the tradeoffs.
Show me the production scars.

That is what builds credibility.

I trust practitioners who expose uncertainty and show their work. I trust people who can explain not just what succeeded, but what broke and why. I trust engineers who understand that AI is not one prompt and one output, but an unstable component that becomes useful only when surrounded by structure.

I do not trust polished certainty without evidence.

And I think more of us need to say that openly.

We need less AI theatre and more systems thinking

This article is not a call to stop experimenting. It is the opposite.

Experiment more. Build more. Test more. Share results more.

But stop pretending that shallow takes are deep expertise.
Stop teaching people that token screenshots are system design.
Stop selling prompting as if it were engineering.
Stop flattening a hard field into content loops.

If you want better AI products, treat AI like what it is: a probabilistic system component that must be constrained, validated, observed, and integrated with care.

That is less sexy than "10 prompts that changed my workflow."
It is less viral than side by side screenshots.
It is less accessible than fake certainty.

But it is real.

And right now, real is exactly what this field needs more of.

Because the problem is no longer that AI is misunderstood by outsiders.

The problem is that too much of it is being misexplained by insiders.

If we want the field to mature, we need fewer self proclaimed experts and more actual practitioners.

Not louder people.
Better ones!

The Old Seniority Definition Is Collapsing

marcosomma — Thu, 05 Mar 2026 08:40:59 +0000

For a long time, “senior developer” was a fairly consistent signal. You expected someone who could hold a large architecture in their head, write clean code with low defect rates, debug almost anything, and reason about performance without guesswork. That bundle made sense because the hardest part of shipping software was often the execution layer: translating intent into correct, maintainable code at speed.

That bundle is breaking.

AI-assisted development is compressing the cost of producing plausible, working code. Not always. Not uniformly. But enough that “I can ship a lot of code quickly” is no longer a reliable proxy for deep seniority. In many teams, velocity metrics are starting to measure who is best at driving the tool, not who is best at building systems that survive contact with reality.

What AI Is Actually Commoditizing

AI is not replacing engineering. It is discounting a specific slice of it: first-pass implementation and the mechanical parts of refactoring. The tool is good at producing code that looks right, compiles, and often passes superficial tests. That changes the economics of execution.

What does not get discounted at the same rate is integration into a real system with real constraints: data contracts, failure modes, security boundaries, observability, and long-term maintenance. In practice, the bottleneck shifts from typing to supervision. You spend less time writing and more time specifying, verifying, reviewing, and correcting.

This is why you can see two realities at the same time. Some developers experience dramatic speedups on bounded tasks. Others experience slowdowns inside large, messy codebases because prompting, waiting, and review overhead replace keystrokes, and because the model lacks the local context that makes a patch truly correct.

What Is Rising in Value

Problem decomposition and system thinking become the differentiator because they convert ambiguity into an executable plan. When you are dealing with something like regulatory delta detection, the hardest part is not writing code. The hardest part is deciding where the complexity actually lives, and what you must make explicit so the system stays correct as the domain evolves. The choice between a graph database and a simpler model is rarely a “tech taste” debate. It is a tradeoff between query expressiveness, operational burden, debuggability, and change management.

Judgment under uncertainty becomes a senior marker because architecture is mostly irreversible decisions made with incomplete information. Moving from direct graph writes to a changeset-based approach with content hashing is not an implementation detail. It is a bet on how you will observe change, roll back safely, explain behavior to customers, and avoid silent drift. That decision quality is what compounds over months.

Context and domain mastery become a moat because they are earned, not generated. If you understand how CELEX identifiers behave in practice, how MiCAR compliance maps to document reality, or how jurisdictions interpret rules differently, you carry constraints that materially shape the architecture. AI can help you express that knowledge. It cannot reliably invent it. Without domain context, you get confident code that is wrong in the ways that matter.

Technical leadership becomes central because building systems is increasingly a multiplayer game. The question is whether you can create a design that other people can implement without constant back-and-forth, and whether you can write specifications that converge rather than fork. This is why a workshop like SDD Pills matters. It trains decision-making and clarity, not syntax.

Mentoring and knowledge transfer become leverage because the highest-value output of a senior engineer is often the improvement of everyone else’s output. AI amplifies this. Teams that learn how to bound AI usage with clear contracts, acceptance criteria, and review discipline get compounding returns. Teams that treat AI as an oracle get compounding debt.

The Uncomfortable Truth: Two Axes Have Split

There are now two skill axes that used to correlate and no longer do.

One axis is technical depth: how well you understand systems, tradeoffs, failure modes, and the long-term consequences of design choices.

The other axis is execution speed: how quickly you can produce working code.

Historically, depth and speed often moved together. Deep engineers tended to execute quickly because they saw the path. Today, you can get high speed with low depth by delegating thinking to the tool. That can look senior on dashboards and in weekly updates. It is not senior if the output is brittle, unobservable, and expensive to maintain.

The inverse also exists: high depth with lower raw output speed can still be very senior if the person consistently makes decisions that reduce risk, eliminate classes of bugs, and increase team throughput.

What This Breaks in Hiring and Promotion

Many organizations still reward visible output: commits, tickets closed, apparent velocity. AI makes these signals noisier because the cost of producing code has dropped, while the cost of validating correctness has often increased. The net effect is that the old metrics over-credit the wrong behaviors and under-credit the work that actually keeps systems stable.

The evaluation problem is that “code shipped” is no longer tightly coupled to “engineering done.” A senior engineer in 2026 is often the person who prevented the incident you never had, removed an entire category of future work by designing the right abstraction, or wrote a spec that made five people productive instead of confused.

What to Measure Instead

The most useful seniority markers become visible if you look for decision quality, not output quantity.

A senior engineer can take an ambiguous problem and produce a specification that is testable and unambiguous. They can make uncertainty explicit by stating what is known, what is assumed, and what the cost of being wrong looks like. They consistently surface non-functional requirements early, especially observability, maintainability, and security, because those are the constraints that explode later.

They use AI as a bounded tool. They know when to ask it for a scaffold, when to demand alternatives, and when to reject a suggestion because they understand the scaling and failure modes. Patterns like Planner, Executor, Reviewer work when they are treated as control systems with clear acceptance criteria, not as theater.

Why “Senior” Is Drifting Toward “Principal”

Role expectations are shifting. Senior used to mean “I can personally deliver complex work.” Increasingly it means “I can make the right decisions and increase the output quality of everyone around me.” That is closer to what many companies used to call principal or architect.

This shift is healthy if organizations adapt their evaluation criteria. It is painful if they do not. People whose main advantage was fast execution will feel the floor drop out, because execution has been discounted. People who were already strong in decomposition, judgment, and leadership will become more valuable, because those skills are now the constraint.

What I’m Seeing in Teams

The developers adapting best to AI-assisted development are usually the ones who already had strong mental models and strong taste. They can turn ambiguity into constraints, and constraints into evaluation. They do not confuse “working code” with “correct system.” They treat AI output as a hypothesis that must be verified against invariants.

The developers struggling are often those who outsource thinking. They can generate a lot of code quickly, but they cannot defend why the design is correct, what it will cost to operate, or how it will fail.

If you are seeing a blur between depth and apparent execution speed, that blur is real. The solution is not to ban AI or to worship it. The solution is to change what you reward, and to interview and promote for the skills that actually compound.

LLMs Are Not Deterministic. And Making Them Reliable Is Expensive (In Both the Bad Way and the Good Way)

marcosomma — Sun, 22 Feb 2026 14:24:05 +0000

Let’s start with a statement that should be obvious but still feels controversial: Large Language Models are not deterministic systems. They are probabilistic sequence predictors. Given a context, they sample the next token from a probability distribution. That is their nature. There is no hidden reasoning engine, no symbolic truth layer, no internal notion of correctness.

You can influence their behavior. You can constrain it. You can shape it. But you cannot turn probability into certainty.

Somewhere between keynote stages, funding decks, and product demos, a comforting narrative emerged: models are getting cheaper and smarter, therefore AI will soon become trivial. The logic sounds reasonable. Token prices are dropping. Model quality is improving. Demos look impressive. From the outside, it feels like we are approaching a phase where AI becomes a solved commodity.

From the inside, it feels very different.

There is a massive gap between a good demo and a reliable product. A demo is usually a single prompt and a single model call. It looks magical. It sells. A product cannot live there. The moment you try to ship that architecture to real users, reality shows up fast. The model hallucinates. It partially answers. It ignores constraints. It produces something that sounds fluent but is subtly wrong. And the model has no idea it failed.

This is not a moral flaw. It is a design property.

So engineers do what engineers always do when a component is powerful but unreliable. They build structure around it.

The moment you care about reliability, your architecture stops being “call an LLM” and starts becoming a pipeline. Input is cleaned and normalized. A generation step produces a candidate answer. Another step evaluates that answer. A routing layer decides whether the answer is acceptable or if the system should try again. Sometimes it retries with a modified prompt. Sometimes with a different model. Sometimes with a corrective pass. Only after this loop does something reach the user.

At no point did the LLM become deterministic. What changed is that the system gained control loops.

This distinction matters. We are not converting probability into certainty. We are reducing uncertainty through redundancy and validation. That reduction costs computation. Computation costs money.

This is why quoting token prices in isolation is misleading. A single model call might be cheap. A serious system rarely uses a single call. One user request can trigger several model invocations: generation, evaluation, regeneration, formatting, tool calls, memory lookups. The user experiences “one answer.” The backend executes a small workflow.

Token cost is component cost. Reliable AI is system cost.

Saying “tokens are cheap, therefore AI is cheap” is like saying screws are cheap, therefore airplanes are cheap.

This leads to an uncomfortable but important truth. AI becomes expensive in two very different ways.

If you implement it poorly, it becomes expensive because you burn money and still do not get reliability. You keep tweaking prompts. You keep firefighting. You keep patching symptoms. Nothing stabilizes.

If you implement it well, it becomes expensive because you intentionally pay for control. You pay for evaluators. You pay for retries. You pay for observability. You pay for redundancy. But you get something in return: a system that behaves in a bounded, inspectable, and improvable way.

There is no cheap version of “reliable.”

Another source of confusion comes from mixing up different kinds of expertise. High-profile founders and executives are excellent at describing futures. They talk about where markets are going and what will be possible. That is their role. It is not their role to debug why an evaluator prompt leaks instructions or why a routing threshold oscillates under load. Money success does not imply operational intimacy.

On the ground, building serious AI feels much closer to distributed systems engineering than to science fiction. You worry about data quality. You worry about regressions. You worry about latency and cost per request. You design schemas. You version prompts. You inspect traces. You run benchmarks. You tune thresholds. It is slow, unglamorous, and deeply technical.

LLMs made AI more accessible. They did not make serious AI simpler. They shifted complexity upward into systems.

So when someone says, “Soon we’ll just call an API and everything will work,” what they usually mean is, “Soon an enormous amount of engineering will be hidden behind that API.”

That is fine. That is progress.

But pretending that reliable AI is cheap, trivial, or solved is misleading.

The honest version is this: LLMs are powerful probabilistic components. Turning them into dependable products requires layers of control. Those layers cost money. They also create real value.

Serious AI today is expensive in the bad way if you do not know what you are doing.

Serious AI today is expensive in the good way if you actually want it to work.

And anyone selling “cheap deterministic AI” is selling a story, not a system.