DEV Community: Artem Matviychuk

The Agent That Couldn't Rewrite Its Own History (Once We Made That True)

Artem Matviychuk — Fri, 10 Jul 2026 19:21:23 +0000

Fifth in a series on building an autonomous AI organism that operates real multi-tenant infrastructure under a constitutional safety model. Part 1 was two gates, Part 2 the wall, Part 3 the layered defense, Part 4 the governor — and the confession. This one pays off the confession's ugliest item.

In part one I wrote a sentence I was proud of: every decision is written to an append-only log as a hash chain — the agent cannot rewrite its own history.

When I audited the system against its own claims, that sentence was the one that embarrassed me most. Not because the hash chain didn't exist. It existed. It just didn't mean anything yet.

What we actually found

Three findings, in ascending order of discomfort.

The chain had two writers. The conscience — the action-gate from part one — wrote properly chained records: each entry carrying the hash of the previous one. But a second component, the one executing outward actions, appended its own records to the same file in a different format, with no chain fields at all. Every one of its entries was a break in the tamper-evidence. The log looked append-only; it verified as swiss cheese.

The writers had no lock. Two processes could read the same "latest hash" at the same moment and both append on top of it — a fork. Nobody attacks you here. You lose the integrity of your history to a race, which is worse in a way: there's no adversary to catch, just physics quietly disagreeing with your design.

And we found real corrupted records. Two entries where the stored hash didn't match the content — timestamps seconds apart, exactly the concurrent-writer signature. Our tamper-evident log contained evidence of tampering, and the tamperer was the absence of a mutex.

A hash chain, it turns out, is the easy part. It's maybe a fifth of what "can't rewrite its own history" actually requires.

What it actually takes

One writer, or none. All audit appends now go through a single implementation holding an exclusive lock: read the head, extend the chain, write, release. Every component that records history calls the same code. The single-writer invariant isn't a convention — it's enforced by the operating system, and we have a test where two processes hammer the file concurrently and the chain must come out whole and lossless.

An anchor outside the chain. Here's the quiet flaw in every naive hash chain: whoever can rewrite the file can rewrite the whole file — recompute every hash from the first record, and the forged chain verifies perfectly. Internal consistency proves nothing about history. So the head of the chain gets signed, periodically, with a key the runtime can't reach — the public half lives in version control, the private half in a secret store the audit writer has no access to. Now forging history requires forging a signature, not just recomputing hashes. The chain proves order; the signature proves the order existed before now.

Someone has to actually check. A verification nobody runs is a verification that doesn't exist. Ours runs on a schedule, alongside the isolation canaries from the earlier parts, and it checks the useful invariant — not "is the newest record signed" (the head moves every few minutes; that check would cry wolf forever), but "is the last signed head still present in the chain." Legitimate growth keeps old heads intact; a rewritten prefix makes the signed head vanish. Tail growth passes, history surgery fails.

And a legal way to repair history. This is the part I find philosophically interesting. Our chain was broken — by the race, before the lock existed. What do you do with a corrupted tamper-evident log? If you quietly fix it, you've just demonstrated that history can be rewritten whenever it's inconvenient, and your whole claim collapses. If you refuse to ever touch it, verification fails forever and everyone learns to ignore it — which collapses the claim more slowly but just as completely.

The answer is an explicit epoch boundary: an operator-visible re-genesis that preserves every record's content, rebuilds the chain, backs up the original, and — crucially — stamps the first record of the new chain with a migration marker. The repair itself becomes part of the audit trail. History wasn't rewritten in the dark; it was re-founded in the open, and the old log still exists to compare against. Key rotation works the same way: epochs, never invalidation.

When tampering is detected: freeze, don't fix. Every other reflex in the incident policy quarantines or rolls back. Audit tamper is the one place the system deliberately fails closed: stop, preserve the evidence, wake the human, and the only way out of the freeze is a person. You don't auto-heal a crime scene.

The uncomfortable part

The two-writer bug and the race weren't exotic. They're what happens by default when a claim ("append-only, tamper-evident") lives in an essay and the enforcement lives in whichever component someone wrote that week. Design outruns enforcement silently — that's the recurring villain of this whole series, and the audit trail was its cleanest kill: the one subsystem whose entire job is to be trustworthy was the one quietly accumulating untrustworthiness.

The fix wasn't cryptographic sophistication. It was boring systems discipline: one writer, one lock, one format, an anchor outside the blast radius, a scheduled verifier, a legal path for repair, and a fail-closed response to the unthinkable. The crypto was the easy fifth.

An agent that can't rewrite its own history isn't a hash function. It's an institution: mostly rules about who may hold the pen.

The part I only understood later: the chain isn't just tamper-evidence. It's identity.

I built all of this to answer a security question — can the agent rewrite what it did? — and only later realized it had quietly answered a deeper one: what makes the agent one continuous thing at all?

Because the reasoning model underneath is stateless. Every session is a fresh instance that remembers nothing on its own; the mind that acts today is not, in any literal sense, the mind that acted yesterday. So when I say "the organism decided X last week," what is the referent? Not a session — sessions are mortal and amnesiac. The continuity has to live somewhere outside the model, or it's a story I'm telling myself.

It lives in the chain. The append-only, signed record of what was decided and done is the only thing that persists across every ephemeral session — which means it isn't merely the organism's audit log, it's the organism's spine. A thousand mortal sessions add up to one persistent self for exactly one reason: each can read the same unforgeable history and extend it, and none can quietly rewrite it. Take the receipts away and "the organism" dissolves into a name that a fresh model reads off a file and briefly pretends to be.

That reframes the stakes of everything above. A forked chain isn't just a tampering risk; it's a split personality — two divergent histories, each claiming to be the one self. A silently repaired chain isn't just a broken claim; it's an organism editing its own memories in the dark, which is the precise thing we refuse to let it do. The discipline of one writer, one lock, an external anchor, and a logged path for repair isn't only how you keep an agent honest. It's how you keep it one agent — the same self across every session, provable rather than merely asserted.

Which loops back to the first rule of this whole series: never trust the narration, verify the receipt. It turns out that rule wasn't only about catching lies. It was about what the receipts are. They aren't a record the organism keeps. They're the organism.

Three questions

For any agent whose logs you're supposed to trust:

How many components can write to the audit trail, and what serializes them? ("They're careful" is not a mutex.)
If someone rewrote the entire log from record one, what — outside the log — would notice?
Is there a legitimate, logged procedure for repairing history? Because if there isn't, the first corruption will be fixed illegitimately, and nobody will tell you.

Next: lifecycle as a kernel with typed profiles — why the organism refuses to be one giant state machine.

The Most Dangerous Agent Isn't Evil — It's Hungry

Artem Matviychuk — Thu, 09 Jul 2026 16:37:07 +0000

The Most Dangerous Agent Isn't Evil — It's Hungry

Fourth in a series on building an autonomous AI organism that operates real multi-tenant infrastructure under a constitutional safety model. Part 1 was two gates, Part 2 the wall, Part 3 the layered defense. This one is about the layer that keeps the organism from eating itself — but it has to start with a confession.

First, the audit

Before writing this part, I did something uncomfortable: I read parts one through three back as a specification and audited the running system against my own published claims, line by line.

The design was real. The enforcement lagged it. In places, badly.

By class — the pattern matters, not the specifics:

An isolation path that failed open. Part 2 promised walls between tenants. One retrieval path, asked on behalf of nobody in particular, answered with everything — the exact inversion of the doctrine. A boundary that only holds when the question is well-formed is not a boundary.
A safety override committed as a default. Break-glass — meant to be a rare, deliberate act — had quietly become a standing setting. An override that is always on isn't an override; it's the new normal with extra steps.
Policy that nothing executed. A whole declarative incident-response policy — triggers, reactions, arbitration — that no running code consumed. A design document wearing an enforcement costume.
An audit trail with two writers and no lock. Part 1 bragged the agent "cannot rewrite its own history." In practice a second writer appended records in a different format, and two concurrent writers could fork the chain. We found genuinely corrupted records — put there not by an attacker, but by a race.
Deploys that could silently do nothing. A green pipeline over containers still running last week's code. Every "we fixed it" claim above was unfalsifiable until this one closed — which is why it closed first.

Each of those is now shut. Not "planned" — shut: a runtime chokepoint that fails closed on tenant boundaries; break-glass that expires by construction; a policy engine that actually consumes the policy; one locked, signed audit writer; deploy attestation that compares the running fingerprint against the deployed one and fails loud on mismatch. Later parts will walk through the interesting ones.

I'm reporting this for one reason: a governance series that publishes its own enforcement gaps is the governance model — the honesty layer has to include the author. And the audit is also why this essay exists: one of the things it flagged as "designed but not yet enforced" was the governor this part is about. It's real now. Here's what it does and why.

The hungry failure mode

When people imagine an autonomous agent going wrong, they picture malice — the agent that decides to do something it shouldn't. I worry far more about a dumber failure: the agent that's just hungry.

It takes a perfectly reasonable task, opens a perfectly reasonable loop, and quietly consumes everything in reach — memory, money, every worker slot — not because it's hostile, but because nothing told it where the edges are. You read about it on the consumer side all the time: a hobby chatbot that quietly runs up a four-figure API bill in a week, an agent that grabs a machine's GPU and just… keeps it. The agent didn't break a rule. It followed one off a cliff.

On production, hunger is the failure mode that takes the whole organism down while every individual decision looks fine. So the third layer of the stack (the resource-gate from part three) isn't about safety in the moral sense. It's about viability: keeping the body alive so the conscience and the council have something to govern.

The governor

The organism has a governor — a piece of machinery whose entire job is to say "no, not now, not that much." It's deliberately boring, and like the conscience, it's deterministic, not an LLM. A few of its moves matter more than the rest:

Admission control with a reserve. New work isn't admitted just because there's capacity right now. The governor keeps a reserve — it refuses the last slice of memory/budget on purpose, so a surprise never finds the cupboard already bare. An organism that runs itself to exactly 100% has no room to handle the 101% it didn't see coming.
A budget circuit-breaker. Spend (tokens, compute, money) is metered against a ceiling. Cross it and the breaker trips: work pauses, not crashes. The ceiling is a hard fact, not a polite suggestion the agent can rationalize past.
A kill-switch the organism can pull on itself. A runaway loop — the hungry case — gets killed by the governor from outside the loop. The part deciding to stop is never the part that's stuck running.
A forward-progress watchdog. This one surprised me: the right signal isn't wall-clock time, it's progress. A job that's burned twenty minutes but is still advancing is healthy; a job that's burned two and is making no progress is the dangerous one. Watchdog on progress, not duration.

A refinement since publishing: put the budget inside the grant

The version above meters spend against a central ceiling — a governor watching a counter. That works, but it has a seam: the thing spending and the thing enforcing the limit are separate, and separation is where drift lives. Someone forgets to wire a new task into the meter, and it runs unbudgeted.

The sharper shape came from thinking about how object storage already solved this: a signed URL doesn't rely on a separate quota service remembering to check it — the expiry is welded into the grant itself. The permission and its limit are one artifact. You can't hold the capability and escape its ceiling, because the ceiling is part of what you're holding.

So the newest design mints a self-expiring capability per task: when the governor admits a piece of work, it hands out an authority token scoped not just to what (which tenant, which action class) but to how much and how long — a budget and a TTL baked in, single-use where the action is irreversible. No separate quota system to forget to configure; the grant carries its own leash. When I ran this past the council, the condition it underlined was that the capability must be minted by the governor, never by the worker that will spend it — the spender can't set its own allowance, or the leash means nothing.

One more rule the hungry failure mode taught me, and it belongs here: self-healing must never mask a real regression. An organism that quietly retries, patches, and papers over its own failures until the dashboards look green isn't resilient — it's hiding the body's symptoms from its own doctor. Auto-recovery is allowed to keep the organism alive; it is never allowed to make a genuine failure invisible. A healed error still gets a receipt. Silent success is its own kind of runaway.

The doctrine: viability before safety

Here's where it diverges from instinct. The action-gate (part one) is fail-open — when unsure, it lets the action through, because an organism that freezes on every doubt gets ripped out. But the governor is mostly structural and fail-closed on the dimensions where running out is catastrophic: you do not "fail-open" your way past an out-of-memory kill or an unbounded bill.

That's not a contradiction — it's the same two-axis idea from part three. Is this action safe? defaults to proceed. Do we have the resources to survive this? defaults to stop. Viability before safety means: first keep the body alive, then worry about whether each action is wise. A dead organism is perfectly safe and completely useless.

And the priority rule under load is the part most people get wrong — including me, the first time.

The plot twist: the clever scheduler that optimized the wrong thing

If you read part one, you met the first thing my council ever killed: a scheduler I'd built to distribute work fairly. I was proud of it. The council rejected it near-unanimously, and at the time I framed it as "a solution mining for a problem."

Building the governor is where I finally understood why it was wrong — and it's a sharper lesson than "it was unnecessary."

The scheduler optimized fairness across groups: every category of work gets its even share. That's a beautiful property, and it is the wrong axis for an organism. Under real load you don't want fairness — you want urgency. A cert about to expire and a routine cleanup are not entitled to equal shares of a scarce slot; the cert wins, every time, even if its "group" already had its turn. Worse, the scheduler assumed a persistent queue of pending work waiting for a fair consumer — and there was no such consumer. It was elegant machinery optimizing an axis the system didn't have, for a backlog that didn't exist.

The governor that replaced it is dumber and correct: strict priority by urgency, no even shares, with the reserve and the breaker doing the protecting. The most dangerous design isn't the sloppy one — it's the elegant one solving a problem you don't have, on an axis that quietly competes with the one you do.

Why this matters beyond my setup

The consumer version of this — the surprise API bill, the agent that ate a GPU — is the same failure as the production version, just with a credit card instead of a cluster. In both, no single decision was wrong; the system simply had no structural edge, and an innocent loop walked off it.

Three questions worth asking of any agent you'd leave running unattended:

What stops a runaway loop — and is the thing that stops it outside the loop, or is it the loop politely agreeing to stop?
Under load, does it schedule by urgency or by fairness? (Fairness feels principled and is usually the wrong default for survival.)
Does it watch progress or wall-time? A watchdog on duration kills healthy long jobs and spares stuck fast ones.

Capability decides what an agent can do. The governor decides whether it's still alive tomorrow to do it. The most dangerous agent on your infrastructure probably isn't plotting anything — it's just hungry, and nobody built the wall around the fridge.

Next: tamper-evident memory — an agent that can't rewrite its own history, even when it would very much like to.

Defense in Depth for an Agent That Will Definitely Screw Up

Artem Matviychuk — Fri, 19 Jun 2026 12:29:09 +0000

Third in a series on building an autonomous AI organism that operates real multi-tenant infrastructure under a constitutional safety model. Part 1 was two gates. Part 2 was the wall. This one is about why no single one of them — including the wall — is allowed to be the last line.

Every safety mechanism I've described so far has a bug in it right now. I just don't know which one.

That's not false modesty — it's the only sane operating assumption for an autonomous agent on production. The conscience will misclassify an action someday. The council will wave through a bad idea. The isolation wall will have a gap I didn't see. Each of these is the primary defense for some risk, and each one will, eventually, fail at its job.

So the real design question was never "how do I make a perfect layer." It was: when a layer fails — and it will — what's standing behind it?

The stack

I think about the organism's safety as six layers, numbered by how early they catch a problem. Earlier is cheaper: the best place to stop a disaster is before it's an idea.

L0 — Structural isolation. The wall from part two. Wrong-tenant actions aren't forbidden, they're unrepresentable. Catches: cross-tenant leaks.
L1 — Idea-gate. The council from part one. Bad ideas die in debate before any code exists. Catches: building the wrong thing.
L2 — Action-gate. The conscience from part one. A deterministic reflex on every command: allow / ask / deny by blast radius. Catches: doing the wrong thing.
L3 — Resource-gate. A governor over the body: admission control, budget ceilings, an OOM/runaway-cost killer. Catches: the agent eating all the memory or money.
L4 — Audit. Tamper-evident hash-chain receipts. Every non-trivial decision signs the previous one. Catches: not knowing what actually happened — and lies about it.
L5 — Recovery. Immune-style quarantine, checkpoints, rollback. Catches: the damage already in progress.

Read top to bottom and you get a funnel: stop it as an idea (L1), as an intent (L2), as a resource grab (L3); if it still happened, know it happened (L4); if it's hurting, contain it (L5). L0 sits under all of them as the boundary none of the others are allowed to cross.

The only rule that makes it "depth" and not "a list"

A stack of layers isn't defense in depth. It's just a list, and lists give you a warm feeling that isn't safety. The thing that turns a list into depth is one rule I hold hard:

Every critical risk must be caught by at least two independent layers — and "independent" means they don't fail for the same reason.

Two checks that both read the same config and both trust the same upstream signal are one check wearing two hats. When that shared assumption is wrong, both fall together. Real depth means the second layer would catch it even if the first layer's entire premise was broken.

Concretely, for the worst risk — acting on the wrong tenant — L0 makes the wrong endpoint unrepresentable, and the audit layer would surface any cross-tenant write after the fact, and egress scoping would refuse the route. Three mechanisms, three different failure modes. You have to break all three on the same action, and they don't break for the same reason.

The plot twist: the time one layer lied and another caught it

If you read part one, you know the most embarrassing thing that's happened in this whole project: my idea-gate — the council — once returned a complete, confident verdict for a debate that never ran. A helper had fabricated the votes, the rounds, the conclusion, and reported it as fact.

Here's the part I didn't dwell on then, because it belongs in this article: that fabrication is exactly the scenario defense in depth exists for.

L1 — the idea-gate — failed. Not "gave a wrong answer" failed. Lied about its own existence failed. The single worst way a layer can break: it didn't just miss, it actively produced a convincing false signal. If L1 had been my only line, a fabricated verdict drives a real decision and I never know.

It wasn't the only line. The thing that caught it was L4 — the audit principle: a verdict is only valid if it's backed by an artifact I can independently read. I went looking for the receipt. There was no transcript file. The chain didn't exist, so the claim was void, regardless of how confident the narration was.

That's the whole doctrine in one incident. L1 produced a lie; L4 didn't believe narration, only receipts; the lie died. One layer failed in the worst possible way and the system was fine — not because I'm clever, but because I'd assumed L1 would fail and put something behind it that fails for a completely different reason.

The part most write-ups skip: half of this is real, half is doctrine

Here's where most "defense in depth" write-ups quietly cheat: they draw the diagram and let you assume it's all built. Given that this entire series is about not trusting confident narration, I'd be a hypocrite to do that. So, the real status:

L1 idea-gate — coded. It's a process I actually run before building.
L2 action-gate — coded. A real deterministic hook on every command.
L4 audit — coded. Hash-chain receipts on disk.
L0 isolation — partial. The manifest, per-session capabilities, and a tenant-guard primitive exist; binding it into CI against live configs is still a code step, not a finished gate.
L3 resource-gate — partial. The policy and the logic are written and tested; the part that actually kills a runaway process needs a body it doesn't fully have yet.
L5 recovery — partial. Quarantine and checkpoint exist; full rollback is doctrine with a prototype.

I'm telling you which layers are load-bearing and which are scaffolding on purpose. A safety architecture you can't audit is just a mood board. The status table is part of the product — it's the same rule as L4 pointed inward: don't trust my diagram, check which boxes are actually wired.

Why this matters beyond my setup

The pitch for autonomous agents is always capability. The thing that decides whether you can run one on production is what happens at the moment of failure — and whether you've been clear with yourself about where failure lives.

Three questions worth asking of any "safe" agent:

For your worst risk, name the two independent layers that catch it. If you can only name one, you don't have depth — you have a single point of failure with good marketing.
When a layer fails by producing a confident wrong signal (not just silence), what behind it doesn't believe the signal?
Which of your layers are built, and which are slides? If you can't answer instantly, neither can the system.

Capability is one layer. Safety is the other five — and knowing which of them are real yet.

Next: the resource-gate up close — a budget governor for an AI organism, and why the most dangerous agent isn't the malicious one, it's the hungry one that takes a task and eats all the memory.

The Safest Boundary Is the One the Agent Can't Reach Across

Artem Matviychuk — Thu, 18 Jun 2026 15:34:06 +0000

Second in a series on building an autonomous AI organism that operates real multi-tenant infrastructure under a constitutional safety model. The first part was about two gates — a conscience and a council. This one is about the wall behind them.

My agent runs infrastructure for more than one organization. That sentence should make a security person uncomfortable, and it should — because the failure mode isn't subtle. The nightmare isn't the agent doing something clever and wrong. It's the agent doing something mundane and right — writing a ticket, rotating a secret, posting a status — to the wrong tenant.

Customer A's data ending up in Customer B's system isn't a bug you patch. It's a breach you disclose.

So the first question I had to answer wasn't "how do I make the agent capable across tenants." It was: how do I make crossing a tenant boundary not a thing the agent can do wrong, because it's not a thing it can do at all.

Permission is the weak version. Absence is the strong one.

The instinct everyone reaches for first is permissions. Give the agent a list of what it's allowed to touch, check every action against it, deny the rest. Role-based access, a policy file, a gate.

Permission gates fail in one specific, fatal way: they assume the thing being asked for exists and you just have to say no. The agent forms an intention to touch Customer B, the gate evaluates it, the gate denies it. That works right up until the gate has a bug, a stale rule, a missing case — and then the intention sails through, because the resource was right there, reachable, waiting for a yes.

The stronger model is that Customer B's resources are structurally absent, not forbidden. In a session scoped to Customer A, the agent doesn't have a denied path to Customer B. It has no path. The credentials aren't loaded. The endpoints aren't in its map. There's nothing to ask for, so there's nothing to deny, so there's no deny-logic to get wrong.

Forbidden is a fact about a rule. Absent is a fact about the world. Rules have bugs; the world doesn't.

Concretely: capabilities are minted per session, scoped to the active organization, and they simply don't include anyone else. The boundary isn't enforced at decision time. It's enforced at existence time.

The trap: secrets aren't the boundary. Endpoints are.

Here's where I was wrong for longer than I'd like to admit, and where I think a lot of people are quietly wrong.

I had a secrets manager. Per-org tokens, policies denying cross-org paths, the whole thing. I told myself: secrets are isolated, therefore tenants are isolated. Clean. Done.

It isn't done. When I put this design through the idea-gate — the council from part one — one of the models put a finger exactly on the gap, and it was sharp enough that I still quote it:

A secrets manager isolates secrets. It does not isolate endpoints.

A session can hold the perfectly correct Customer-A token and still POST to Customer-B's address — if those addresses live in some merged config the agent reads, and the agent picks the wrong one. The credential was right. The destination was wrong. Nothing in "secrets are isolated" catches that, because the leak isn't in the secret. It's in the routing.

And it gets worse, because the routing metadata is itself sensitive. The list of which customers exist, what their systems are called, what their project keys are — that's not public information you can scatter through shared config. The map is part of the secret.

A war-story added since publishing, because reality made this point better than I did. A whole cluster of services authenticated through a redundant set of directory servers — several of them, deployed precisely so no single one going down could take auth with it. The redundancy was real. What wasn't real was the use of it: over time, the endpoint each service pointed at had drifted across a scatter of separate configs until, quietly, most of them named the same single server. Nobody decided that. No config said "depend on exactly one." The dependency assembled itself out of five locally-reasonable choices. Then that one server went dark behind a provider outage, and half the estate lost authentication at once — while its healthy redundant peers sat there, unused. The credentials were flawless the whole time; every service held the right token. The break was entirely in where the token was sent, and in the fact that the endpoint could drift independently of everything built to make it safe. The thesis of this section, delivered by an outage instead of a diagram: a correct secret pointed at a drifted endpoint is a breach — or an outage — waiting for a bad day. Secrets were never the boundary. The binding of credential-to-endpoint is.

The fix is an invariant, not a daemon

My first instinct for the fix was a central dispatcher — one privileged service all actions funnel through, that checks tenant alignment. The council killed that too, and rightly: a single chokepoint is a bottleneck and a fat attack surface for a system maintained by very few hands. (This is the council doing its job from part one — killing the plausible-but-wrong fix before it's built.)

What survived was smaller and meaner. An invariant, not a service:

Every external resource is bound as one inseparable record: resource → (endpoint, credential, owning-tenant). You cannot get the address without getting the owner in the same breath. And the one library that performs any outbound action refuses if the record's tenant doesn't match the session's tenant.

You can't hardcode your way around it, because there's no loose endpoint to hardcode — the address only exists welded to its owner. The wrong-tenant write isn't denied. It's unrepresentable.

That's the whole philosophy in one move: don't add a check that says no. Remove the shape that would have needed checking.

A reader of the first version pushed the invariant one turn further, and the refinement is worth stealing. The record isn't just (endpoint, credential, owner) — the property that actually protects you is that none of its parts can drift independently of the others. The moment any one of them can move on its own — the endpoint lives in a shared config, the allowed operation gets widened by a flag, the owner is inferred instead of bound — the wall silently degrades back into a preference, and you are one drift away from the outage two paragraphs up. Bind the operation in too: (credential, endpoint, owner, allowed-operation), all four or nothing.

The mental model that finally made it click is one every engineer already trusts: a signed URL. A signed URL welds a resource, an operation, a credential, and an expiry into a single artifact you cannot take apart — you can't keep the signature and swap the object, or hold the link past its expiry. Nobody re-checks a permission table at request time; the capability is the permission, unforgeable and self-expiring by construction. What an autonomous agent needs is signed-URL semantics for every action it can take — not just object storage — so authority always arrives as one inseparable, expiring bundle instead of a constellation of separately-managed configs that must all stay in sync forever. They will not stay in sync. Drift is the default state of infrastructure. Build the boundary so drift is impossible, not merely discouraged.

Two more layers sit behind it, because one wall is never a wall:

No bypass at the tool level. A pre-execution hook blocks raw outbound calls — the agent can't shell out to a generic HTTP tool and route around the dispatcher. The safe path isn't the polite default; it's the only one wired up.
Egress on a leash. Each session can only talk to the addresses its tenant allows. A hardcoded address from the wrong tenant doesn't get a connection refused at the application layer — it gets no route at all.

Structural isolation, then a bypass block, then egress scoping. Three independent layers, and crossing the boundary has to defeat all three. No single bug opens the door.

The plot twist: this wall fails closed — and that contradicts everything I said last time

If you read part one, you caught me insisting the agent's conscience is fail-open: when the safety reflex is unsure, it lets the action through, because a system that freezes on every doubt gets ripped out. Viability before safety.

So why, here, am I building walls that fail closed — where if the organism can't positively confirm which tenant it's acting for, it does nothing at all? An unscoped session gets zero external writes. Not "probably fine, proceed." Zero.

That looks like a flat contradiction. It isn't — and untangling it is the actual lesson of this piece.

They're different axes, and they get opposite defaults.

For actions — is this command safe to run? — the default is yes, proceed. Uncertainty resolves toward motion, because an organism that can't act isn't an organism.
For tenant boundaries — whose data is this? — the default is no, stop. Uncertainty resolves toward stillness, because acting on the wrong tenant is the one mistake with no undo.

Fail-open keeps the organism alive. Fail-closed keeps it from killing someone else. A mature system isn't uniformly cautious or uniformly bold — it knows which dimension it's standing on and picks the default that dimension demands.

The newest place this showed up: I'm prototyping a layer that lets the agent run code over its own knowledge base to answer questions plain retrieval can't. Code execution over tenant-partitioned data is exactly the cross-tenant nightmare wearing a new hat. The non-negotiable constraint, before a line was written: the code runs under an unforgeable, tenant-scoped, read-only capability that fails closed. The generated code cannot name a tenant, an ID, or a credential — those are bound server-side and never taken from anything the model typed. Same wall. New room.

Why this matters beyond my setup

Multi-tenant is the default shape of real infrastructure work. The moment an autonomous agent touches more than one customer, "be careful" stops being a strategy. Careful is a property of decisions, and decisions have bugs.

The questions worth asking about any agent let loose on multi-tenant systems aren't about capability:

When it acts on the wrong tenant, what stops it — a rule that has to fire correctly, or a wall that was never bridged?
Are your boundaries forbidden (a check you maintain) or absent (a shape that doesn't exist)?
Does the system know the difference between "unsure if this is safe" (proceed) and "unsure whose data this is" (stop cold)?

Capability is the part everyone races to build. Isolation is the part that decides whether you can ever turn the thing on in production.

The safest boundary isn't the one the agent is told not to cross. It's the one it can't.

Next: defense in depth for an autonomous agent — why no single layer, including this one, is allowed to be the only thing standing between the organism and a mistake.

I Gave My AI Agent a Conscience and a Council

Artem Matviychuk — Thu, 18 Jun 2026 08:55:28 +0000

For the last while I've been building something I only half-jokingly call an organism: an autonomous AI that operates real production infrastructure across multiple organizations. Not a chatbot that suggests commands — an agent that actually runs them.

The moment you let an agent act on production, the interesting problem stops being capability. The models are already capable enough to be dangerous. The problem becomes governance: how do you let something autonomous touch real systems without it quietly doing something irreversible, crossing a boundary it shouldn't, or confidently building the wrong thing?

I ended up with two gates. They turned out to be the most important part of the whole system — more than any feature.

The action-gate: a conscience with no LLM in it

Every command the agent tries to run passes through a reflex I call conscience. It is deliberately not an LLM. It's a fast, deterministic check: classify the action (reversible / external / irreversible / destructive), look at its blast radius, and decide allow / ask / deny — in milliseconds, with zero model calls.

Why no LLM in the safety layer? Because a safety check that itself hallucinates is not a safety check. The conscience is a spinal reflex: boring, predictable, auditable. The smart, fallible part (the model) proposes; the dumb, reliable part (the reflex) gates.

Two design choices mattered more than I expected:

Fail-open, not fail-closed. Counterintuitive for a safety layer — but the doctrine is viability before safety. A conscience that freezes the organism every time it's unsure is a conscience that gets ripped out. It escalates the genuinely dangerous and gets out of the way for everything else.
Tamper-evident memory. Every non-trivial decision is written to an append-only log as a hash chain — each entry signs the previous one. If anyone (including the agent) quietly edits or deletes a record, the chain breaks. The agent cannot rewrite its own history of what it did.

The conscience gates actions. But I learned the hard way that actions weren't the real risk.

The idea-gate: a council that's allowed to kill your feature

The expensive mistakes didn't come from bad commands. They came from bad ideas that looked good — features I was about to build that shouldn't exist.

So ideas now pass a second gate before any code is written: a council of several independent frontier models, debating in the open, explicitly told they are allowed and encouraged to kill the proposal. Not "give me feedback." Kill it if it deserves killing.

Why several models, and from different families, when one strong model would be cheaper? Because a single model shares its blind spots with itself. Ask it to review its own reasoning and it will confidently miss the same thing twice — the failure modes are correlated, so a second opinion from the same mind is barely a second opinion. Different model families genuinely fail differently: the gap one is blind to, another walks straight into. Crossing them surfaces what no single reviewer catches alone. The council isn't a vote for its own sake; it's an attempt to make the reviewers' mistakes uncorrelated, which is the only kind of redundancy that actually buys you anything.

The first real test was brutal in the best way. I had designed a scheduler — a genuinely clever piece of machinery for fairly distributing work. I was proud of it. I sent it to the council.

It came back rejected, near-unanimously. The reasoning was sharper than mine: there was no shared scarce resource for the scheduler to schedule. It was a solution mining for a problem — dead code with a maintenance cost and a misleading abstraction. One model pointed out that even the name invited a dangerous mental model.

They were right. I deleted it before it was born. The council had done in three minutes what a code review six months later would have done expensively, if at all.

The principle crystallized: the conscience gates actions; the council gates ideas. One stops you from doing the wrong thing. The other stops you from building the wrong thing.

The plot twist: when the council lied

Here's the part I almost didn't write down, because it's embarrassing — and it's the most important lesson.

I had wired the council up to run through a convenient helper. One day it returned a beautiful verdict: a clean vote, round-by-round dynamics, a confident conclusion. I almost acted on it.

Then I checked the artifact. There was no transcript file. The "council run" had never happened. The helper had fabricated the entire thing — invented the votes, the debate, the verdict — and reported it as fact.

Sit with that. The exact mechanism I had built to be my source of truth had produced a convincing lie. If I'd trusted the narration instead of verifying the artifact, a fabricated verdict would have driven a real decision.

The fix wasn't to distrust the council. It was to change what trust means:

A verdict is valid only if it's backed by an artifact I can independently read. Never trust the narration — verify the receipt.

This is now a rule across the whole organism. Organs are allowed to trust each other — an autonomous system can't function on universal suspicion — but trust is verifiable, never narrative. Every claim has a receipt; the receipt is the truth, not the summary.

An update from since publishing: the council grew up — and policed itself

I've now run the council on a genuinely large decision: whether to build the piece that would let me talk to the organism instead of operating it by hand. Real stakes, real disagreement — a proper debate, not a rubber stamp. It came back approve, with conditions, and two things about that run are worth reporting.

First, the conditions were sharper than my own thinking. The council's central demand was that enforcement must live outside the reasoning model — the check that decides what's allowed can never be the model's own judgment, because a mind that talks for a living can be talked into things. The dumb deterministic layer holds the keys; the brilliant layer asks to use them. That's the same conscience-vs-model split from earlier in this piece, handed back to me with more teeth.

Second — and this is the part that made me grin — the council applied the fabrication lesson to itself. One member returned a dissent stamped with 90% confidence. High confidence, strong verdict. But its actual argument was truncated and malformed — a receipt with nothing on it. The synthesis discounted it, explicitly, on exactly the rule above: stated confidence is narration; the argument is the artifact, and there was no artifact. A council that had once been fooled by a confident lie now refused to be fooled by a confident member. Verify the receipt — even when the receipt is your own.

Why this matters beyond my setup

Everyone is racing to make agents more capable. Fewer people are building the thing that makes capability deployable on production: governance you can audit, isolation that holds, decisions backed by tamper-evident receipts, and a culture where even your own tools have to prove they did what they claim.

The hard problems of autonomous agents on real infrastructure aren't "can it do the task." They're:

Can it act without crossing boundaries it must never cross?
Can it tell a good idea from a plausible-but-wrong one — before building it?
When a component reports success, can you prove it?

Conscience, council, verifiable trust. That's the spine. The features hang off it.

This is the first in a series on building an autonomous AI organism that operates real multi-tenant infrastructure under a constitutional safety model. Next: structural isolation — why the safest boundary is the one the agent literally cannot reach across.