DEV Community: Dariusz Newecki

My AI System Logged 35,669 LLM Calls. It Still Couldn’t Tell Me What They Cost.

Dariusz Newecki — Sat, 13 Jun 2026 09:45:57 +0000

CORE had telemetry.

That was the comforting part.

Every LLM exchange was being logged. Prompt tokens. Completion tokens. Duration. Cognitive role. Model snapshot. Timestamp. Privacy level. Enough information to reconstruct what the system had asked, which model had answered, and how the autonomous loop had used the result.

Then I asked the obvious question:

What did the last month of LLM work cost?

The database had no answer.

Not a bad answer. Not an approximate answer. No answer.

The cost_estimate column existed. It was even part of the log model. But across 35,669 recorded LLM calls, it was populated exactly zero times.

Every row was NULL.

That is the kind of bug that looks small until you understand what kind of system CORE is trying to become.

CORE is not just a wrapper around LLM calls. It is a governance runtime for AI-assisted software development. The point is not that an AI writes code. The point is that every AI-produced change must be traceable, authorized, constrained, audited, and defensible.

So when cost attribution was missing, this was not just a FinOps bug.

It was a governance blind spot.

The System Could Explain the Work, But Not the Bill

The strange thing was that most of the telemetry was already there.

CORE knew which cognitive role made the call.

It knew whether the call came from an architect, coder, reviewer, coherence analyst, or some other internal role.

It knew which model handled the request.

It knew the token counts.

It knew when the call happened.

That meant I could ask questions like:

Which cognitive roles are consuming the most tokens?
Which models are being used by which part of the system?
Which workflows are driving LLM activity?
How much autonomous reasoning happened during a given period?

But I could not ask:

Which cognitive role costs the most?
Did routing this role to a stronger model actually change the cost profile?
Did a model swap increase operational cost?
Is local inference replacing paid inference in the places where it should?
What did the last seven days of autonomous governance actually cost?

That matters because model routing is not just a technical preference.

In a governed system, model routing is an operational decision.

If I decide that one role should use a stronger model because it performs architectural judgment, while another role should use a cheaper model because it performs mechanical cleanup, that decision should be defendable.

Not with vibes.

Not with “it feels cheaper.”

With evidence.

CORE could show token volume. It could show model usage. It could show decision traces. But it could not show cost.

That meant the governor had incomplete information.

And in CORE, incomplete evidence is not a cosmetic issue.

Cost Is Part of the Decision Trace

This is where the bug became more interesting than the fix.

Most AI systems treat cost as billing metadata. Something you check in a provider dashboard. Something finance looks at later. Something external to the actual governance loop.

I do not think that is enough for autonomous systems.

Once a system starts making or proposing operational decisions, cost becomes part of the decision surface.

A governed AI system should be able to answer:

Why did you choose this model for this role?

A good answer might involve capability, reliability, privacy, latency, and cost.

But if cost is outside the system, the answer is already incomplete.

The system can say:

I used this model because it was assigned to this cognitive role.

That is not enough.

It should also be able to say:

This role consumed X tokens, cost Y over the last period, and produced Z accepted outcomes. The routing remains justified.

Or:

This role is now disproportionately expensive compared to its contribution. Reconsider the routing policy.

Or:

This model was swapped recently and cost increased faster than resolution quality improved.

Without cost telemetry, those questions move outside the system.

And once evidence moves outside the system, governance becomes manual again.

That is exactly the kind of silent drift CORE exists to prevent.

The Bug Was Embarrassingly Simple

The actual defect was not dramatic.

The writer existed. The log table existed. The column existed.

But the only write path set cost_estimate to None.

Every time.

That is the worst kind of governance bug: structurally prepared, semantically empty.

The system looked like it had cost attribution because the field existed. Queries could reference it. Reports could include it. The schema suggested accountability.

But the data was never written.

That is more dangerous than not having the field at all.

When a system lacks a field, the gap is visible.

When a system has a field that is always empty, the gap hides behind architecture.

And if you are building governance software, hidden gaps are the enemy.

The Fix Was Boring. That Was the Point.

The fix was not to call an API dashboard.

The fix was to make cost part of CORE’s own evidence model.

That meant adding a rate source and computing cost at write time.

The important part was not just “multiply tokens by price.” The important part was preserving the evidence correctly.

A model’s price is not timeless. Pricing changes. Routing changes. Model names move. Providers revise their commercial terms. Local models may cost zero externally but still matter operationally.

So the fix needed a rate table with history.

The design became:

Store input and output rates separately.
Key rates to the model snapshot used at the time of the call.
Use an effective_from timestamp so historical rows can be priced against the rate that was valid when the call happened.
Compute cost_estimate when the exchange is logged.
If no rate exists, keep the cost NULL but log the missing-rate gap explicitly.
Preserve the existing fire-and-forget telemetry path so cost lookup failures do not break the system.

That last point matters.

Telemetry must not become a new point of fragility.

If an LLM call succeeds, but cost lookup fails because a rate was not configured, CORE should not crash the workflow. It should record the gap and keep moving.

The failure itself becomes evidence.

That is the governance pattern:

Do not pretend the system knows.
Record that it does not know.
Make the gap visible.

Why External Dashboards Are Not Enough

A reasonable objection is:

Why not just use the provider dashboard?

Because provider dashboards do not know CORE’s governance structure.

They may know the account. They may know the model. They may know aggregate usage. They may know invoice-level cost.

But they do not know CORE’s cognitive roles.

They do not know which call was architectural judgment and which call was mechanical formatting.

They do not know which proposal a call supported.

They do not know which finding the proposal resolved.

They do not know whether a call contributed to a successful remediation, a rejected proposal, a failed validation, or an architectural dead end.

CORE needs cost attribution inside its own consequence chain.

Not because provider dashboards are bad.

Because they answer a different question.

The provider answers:

What did this account spend?

CORE needs to answer:

What did this autonomous governance loop spend, why, under which role, toward which outcome, and under which authority?

Those are not the same question.

This Changed How I Think About Autonomy

Before this bug, I thought of autonomy mainly in terms of action.

Can the system find a violation?

Can it propose a fix?

Can it execute that fix through governed atomic actions?

Can it verify the result?

Can it stop itself when a rule is violated?

Those are still the right questions.

But they are not enough.

A system that acts autonomously also consumes resources autonomously.

That means resource use must be governed too.

Not eventually.

Not as a dashboard afterthought.

As part of the same trace.

Because every autonomous action has at least four dimensions:

What happened?
Why was it allowed?
What changed?
What did it cost?

If the system cannot answer all four, the audit trail is incomplete.

It may be technically impressive.

It may even be useful.

But it is not fully governable.

The Uncomfortable Part

The uncomfortable part is that I built the system to catch exactly this kind of thing.

CORE is supposed to expose drift between what the system claims and what it actually does.

And here it was: a schema claiming cost attribution, a runtime producing none.

That is not a failure of the idea.

That is the idea doing its job.

The point of CORE is not to never have governance gaps.

That would be fantasy.

The point is to make the gaps discoverable, nameable, fixable, and eventually enforceable.

This one started as a missing cost calculation.

It ended as a clearer rule:

If an autonomous system consumes LLM resources, cost attribution is part of governance evidence.

Not billing evidence.

Governance evidence.

The Real Lesson

The lesson is not “remember to calculate cost.”

That is too small.

The lesson is:

Autonomy without cost visibility is not operational autonomy.
It is automation with an unpriced control loop.

If a system can decide, propose, repair, retry, delegate, and call models thousands of times, then the governor must be able to see what that activity costs.

Per model.

Per role.

Per period.

Per outcome.

Otherwise the system is asking to be trusted.

And CORE’s entire position is that autonomous systems should not be trusted.

They should be governed.

So now CORE can start answering the question it should have answered from the beginning:

Not only what did the AI do?
Not only why was it allowed?
Not only what changed?
But also: what did it cost?

That is boring.

That is accounting.

That is governance.

And for autonomous AI systems, that is exactly the point.

CORE is open source here: github.com/DariuszNewecki/CORE

Autonomous AI should not just run. It should leave receipts.

When One Enum Is Secretly Two

Dariusz Newecki — Mon, 01 Jun 2026 20:05:10 +0000

I was one commit away from a bug that would never have thrown an error.

My system keeps every closed vocabulary in a single file — one source of truth for "here are the legal values for this field." One of those vocabularies described filesystem operations: read, create, modify, delete. Clean, small, obvious. Two different parts of the system were going to read it.

The first part is authorization. Every capability in the system declares a filesystem profile — what it's permitted to do. This worker may modify files but may not delete them. For that, the distinctions that matter live on the write axis: create, modify, and delete are three different permissions you might grant or withhold independently. Reading? Reading is just read. One bucket. The profile doesn't need to slice it finer.

The second part is audit. A taxonomy classifies every filesystem call the code makes, so a completeness check can prove that no category of access slips by unaccounted for. For that, the distinctions that matter live on the read axis: Path.read_text reads a file, Path.glob enumerates a directory, yaml.safe_load(path) parses a protected config off disk. Those are three different audit subjects. Writing? For the audit's purposes, writing collapses to a single write — because the policy it enforces is shaped like this namespace forbids the write class, full stop.

Look at the inversion. Authorization splits writes and collapses reads. Audit splits reads and collapses writes. Same domain — filesystem operations — sliced along perpendicular axes, because the two readers are answering different questions.

And they were about to share one enum.

The race that nearly buried it

The reason they were about to share it is almost funny. Two design decisions, written months apart, both declared they'd use "the filesystem operation vocabulary." A clause settled the overlap: whichever one gets implemented first creates the list; the second just uses what's already there. A materialization race — a data race, but for a vocabulary decision spread across two documents.

The authorization side shipped first and wrote [read, create, modify, delete]: the write axis. Which meant the audit side, when I finally got to it, would have inherited a vocabulary with no word for traverse and no word for parse. It would have had to lie in the only language it was given.

The tell

Here's the moment it stopped being a style preference and became undeniable.

Take Path.glob. Under the authorization vocabulary, the most honest label for it is read — it doesn't mutate anything. Under the audit vocabulary, read is flatly wrong; it's traverse, and that distinction is the entire point, because "this code enumerated a protected directory" is a different finding than "this code read a single file."

Same call. Two correct answers. The enum can only hold one.

That is the signature of one enum doing two jobs: a single concrete value that belongs in different buckets depending on who's asking. There is no naming fix for that. read isn't badly named. It's being asked to mean two things at once.

Why DRY was lying to me

The pull toward one enum was DRY, and DRY is usually right, which is exactly what makes this trap good. One vocabulary, one place, both consumers referencing it — that looks like the discipline you're supposed to practice. It feels like hygiene.

But DRY is about not duplicating knowledge, not about not duplicating shape. Two vocabularies that happen to overlap in spelling are not a repeated fact. They're two separate decisions that rhyme. Merging them doesn't remove duplication — it manufactures coupling, binding two things that change for different reasons. The authorization vocabulary changes when the permission model changes. The audit vocabulary changes when the set of call-classes you care about changes. Different forces, different cadence, different owners.

That's the Single Responsibility Principle, except aimed at a data type instead of a class: if two independent forces can each demand an edit, you're holding two things.

One spelling, one meaning

The rule I'd actually broken has a cleaner statement than any of this: one spelling, one meaning.

I had one spelling — fs_operation_class — quietly carrying two meanings. And that's the same defect as the version everybody already polices: two spellings for one meaning, userId in one file and user_id in the next. We catch the synonym on sight; linters scream about it.

The homonym hides. One word, two meanings, nothing visibly duplicated. It doesn't look like a smell. It looks like economy.

How I spot a fused enum now

Any one of these is a yellow flag. Two of them is a decision:

Two subsystems both branch on the enum, for unrelated decisions.
You keep wanting to add a value "just for" one consumer that's meaningless to the other.
A single concrete value belongs in different buckets depending on which consumer is reading it.
The description has to say "for X this means…, for Y this means…"

That last one is the confession. The moment your docstring needs the word "for," you have two enums.

The fix, and the line

The fix wasn't clever. Two enums.

One keeps [read, create, modify, delete] for authorization. A new one carries [read, traverse, parse, write, neutral] for audit. Their overlap is exactly one value — read — and even that is a coincidence of spelling, not a shared decision: it's just the single operation that means the same thing under both questions. Each vocabulary is now free to move along its own axis without dragging the other behind it.

The sentence I wrote into the decision record, the one I'll reuse for the rest of my life:

When a unification claim doesn't survive the material differences between two surfaces, the unification was the bug — not either surface.

Why this one kept me up

A normal refactor earns a shrug. This one didn't, and here's why.

Nothing about the fused enum would have crashed. In a fail-closed system, that's the nightmare case: it doesn't fail closed, because it doesn't fail at all. It validates fine. It loads fine. It quietly hands one of its two readers a lossy answer, forever.

The authorization side would have been correct. The audit side would have cheerfully reported all reads accounted for — while folding traversals and parses into a read bucket it could no longer tell apart. A completeness check that's complete only because it went blind.

The worst bugs in a governance system aren't the ones that throw. They're the ones that pass.

One enum, two meanings. Go check your docstrings for the word "for."

This is from CORE, an open-source constitutional governance runtime for AI-generated code. The decision above is ADR-080; the two enums live in the repo if you want to call my bluff: github.com/DariuszNewecki/CORE

The Most Powerful Developer in the Room Has Never Heard of SOLID

Dariusz Newecki — Tue, 19 May 2026 10:01:36 +0000

We have spent fifty years learning how to write software that doesn't collapse under its own weight.

Not because programmers were lazy. Because software is genuinely hard to reason about. Because complexity compounds. Because a decision made on day one creates debt that surfaces on day three hundred. Because the people who wrote the code are not always the people who maintain it.

So we built discipline.

The Accumulated Wisdom

SOLID. DRY. YAGNI. Design patterns. Architectural patterns. Separation of concerns. Dependency inversion. Test-driven development. Code review. Static analysis. Type systems. CI/CD. Linters. Convention over configuration.

None of these emerged from theory. They emerged from failures — real systems that broke, real teams that couldn't move, real codebases that became archaeologies nobody wanted to excavate. Each principle is a scar turned into a rule.

Over fifty years, the industry converged on a shared understanding: determinism in software is earned, not assumed. You earn it by making your intentions explicit, your dependencies visible, your contracts enforced, your changes traceable.

This is not a style preference. This is load-bearing knowledge.

The New Developer

Then came AI.

The most capable code producer ever built. It can implement a feature in seconds. It can refactor a module, write tests, explain its own output, generate documentation. The raw productivity is real.

It is also non-deterministic. It has no persistent architectural memory. It doesn't know what your team decided six months ago. It doesn't know which patterns you've banned and why. It doesn't know that PathResolver was excluded from the size rule for a documented reason, not by accident.

It produces code that looks correct. It violates the architecture underneath.

The most powerful developer in the room has never heard of SOLID. Of course the model can recite SOLID. That is not the point. It does not persistently enforce SOLID across a living codebase. It doesn't remember your last session. It has no idea what the codebase looked like before it touched it.

This isn't a complaint about AI capability. It's a structural observation about what AI is. Non-determinism and context blindness are not bugs to patch in the next model release. They are properties of the tool.

The Industry Reflex

The industry noticed the problem. The response was predictable.

AI makes mistakes → add AI reviewers. AI agents drift → add AI supervisors. AI generates inconsistent output → add AI validators. The ingredient that created the problem became the ingredient of the cure.

This is architecturally incoherent.

You cannot fix non-determinism with more non-determinism. You cannot make an unreliable system reliable by adding more unreliable components. Each layer of AI-on-AI increases the surface area of failure and makes the system harder to reason about, not easier.

The instinct is understandable. AI is the most powerful tool available, so more of it feels like more solution. But power and reliability are different properties. Stacking power doesn't produce reliability.

But What About Agents, Swarms, and Prompt Engineering?

Agents are orchestrated AI. Swarms are parallel AI. Prompt engineering is negotiated AI.

None of these change the underlying property: the output is non-deterministic and the enforcement is absent.

Adding coordination layers to non-deterministic components produces a more complex non-deterministic system, not a governed one. You don't get reliability — you get a larger blast radius when something goes wrong.

Prompt engineering is the most revealing tell. If your governance strategy is a better prompt, your governance lives inside the thing you're trying to govern. That's not governance. That's negotiating with an unreliable contractor and hoping they remember the rules next session.

A prompt is not a law. A law doesn't ask the model to comply.

The Right Abstraction Layer

The answer is not at the AI layer. It never was.

When C gave programmers power without safety, the industry didn't respond with "better C." It built type systems, memory safety, static analysis, formal verification. The answer was a deterministic layer above or around the dangerous tool.

When manufacturing processes introduced variability, the answer wasn't "better machines." It was fixtures, jigs, quality control, and documented standards. The machine operates inside a governed system.

The principle: when your most powerful component is also your least reliable, you don't replace it or double down. You wrap it in a deterministic system that makes its failures visible, traceable, and correctable.

This is what CORE does.

What CORE Actually Is

CORE is not an AI agent. It is a governed software factory.

The AI is one component — the code producer. It is never trusted. Its output is a proposal, not an execution. Before anything reaches the filesystem, it passes through a constitution: human-authored rules that encode the accumulated discipline — architectural standards, dependency contracts, naming conventions, structural invariants.

The constitution is law. The AI is a worker. The governor is a human who writes intent, not code.

Every action is traceable. Every violation is explicit. Where approval is required, execution cannot proceed until approval is recorded. The audit trail is queryable. The consequence chain — Finding → Proposal → Approval → Execution → File changes → New findings — is materialized as verifiable rows, not inferred from logs.

This is not a productivity tool dressed up in governance language. It is a feedback control system. The AI produces output. The governance layer evaluates it. Violations halt execution. The system converges toward a constitutionally compliant state — or it escalates to the human governor.

The Missing Piece

Here is what the "add more AI" approach abandons without naming it:

Fifty years of hard-won discipline about how to make software systems trustworthy.

Every SOLID principle. Every architectural decision record. Every code review convention. Every linting rule. Every naming standard. Every test coverage requirement. All of it evaporates the moment you let an AI write unsupervised into your codebase.

Not because the AI can't produce compliant code. Because there is nothing enforcing compliance. The knowledge exists. The enforcement is absent.

CORE is the bridge. The constitution is the fifty years of discipline made machine-enforceable. The governor doesn't write code — the governor holds the standards. The AI doesn't hold anything. It produces. The system judges.

The Question Worth Asking

The industry is asking: which AI do you trust?

That is the wrong question.

The right question is: have you built a system where trust doesn't need to be extended?

Where the AI can be wrong, and the system detects it. Where the AI can drift, and the audit catches it. Where the AI can hallucinate, and the constitution blocks it. Where a person who is not a programmer can build production-grade software by writing intent and governing AI — instead of trusting it.

That's not a vision. That's what CORE does today.

CORE is open source — github.com/DariuszNewecki/CORE. If you're building in the governed-AI or regulated-software space, comments are open.

Four Gates. One Governor. Zero Code Written. CORE Is Autonomous.

Dariusz Newecki — Wed, 13 May 2026 12:10:27 +0000

When I defined A3 fourteen weeks ago, I wrote: "The daemon runs continuously, the Blackboard clears, the codebase converges, and every action is visible." Today all four gates that operationalize that definition are closed. I want to be precise about what that means — and honest about where the evidence is still accumulating.

What A3 Actually Is

A3 is not a version number. It is a state the system either is in or isn't.

I defined it with four gates because "autonomous" is a claim that's easy to make and hard to prove. Each gate closes one dimension of the proof. You can't skip one and still make the claim honestly.

The four gates:

G1 — Loop closure. An autonomous fix lands end-to-end on a real example. Finding detected → proposal created → proposal approved → execution succeeded → re-audit confirms resolution. Not against a toy. Against the live codebase.

G2 — Convergence. Sustained state where the rate of finding resolution exceeds the rate of finding creation. This is what makes "autonomous" mean something rather than describing a system that runs forever without making progress.

G3 — Consequence chain. Every action is traceable. Finding → Proposal → Approval → Execution → File changes → New findings — all six edges materialized as queryable rows. The governor doesn't have to read source code to know what happened. The chain is the answer.

G4 — Governance in .intent/. No enforcement logic, path mappings, or policy thresholds live in src/. All of it lives in .intent/ — human-authored files, read-only to CORE at runtime, never written by autonomous workers. This gate is the reason the governor role is real rather than nominal.

All four are closed.

The One That Took Longest to Get Right

G3 closed first — May 1. G1 was proven during the 79-second self-heal I wrote about last week. G4 closed May 10, after a campaign that moved 32 operational config sections out of hardcoded src/ literals and into governed YAML, touching 113 files.

G2 was the last one, and the most careful.

The structural piece was a circuit-breaker. After N consecutive identical-signature proposal failures, the affected findings are marked DELEGATE and a hazard finding is posted to the Blackboard. What this does: it converts systematic errors — an LLM producing the same wrong output over and over, a rule with no valid automated fix — into governance signals rather than infinite churn. The system doesn't spin. It escalates.

That's the architecture of convergence. The daemon can't get stuck in a loop it can't exit. Every unmappable pattern eventually surfaces as a human decision.

I closed G2 on May 12. Band D — 107 issues, fourteen weeks of engine integrity work — closed the same day.

What the Audit Shows

Current state: core-admin code audit returns PASS, 20 findings.

Fourteen weeks ago, before Band D started, the audit returned findings in the hundreds across namespaces we didn't even have rules for yet. The findings weren't noise — they were governance debt we couldn't see because the instruments weren't built yet.

That's the counterintuitive thing about this kind of system. Adding a rule doesn't fix violations. It makes violations visible. When ADR-031 landed — no hardcoded runtime directory paths — it surfaced 40 pre-existing violations in one run. The audit went from PASS to FAIL. That FAIL was progress.

20 findings at PASS is not a clean codebase. It's a codebase where every remaining finding is known, tracked, and either queued for autonomous remediation or parked as a deliberate human decision. The difference between "has findings" and "has uncontrolled findings" is the entire value proposition.

The Governor Role, Fourteen Weeks In

I am not a programmer. I have not written implementation code during this project.

What I've done: defined constitutional rules, authored ADRs, reviewed proposals that required architectural judgment, held the line on decisions where the system wanted to go one way and the architecture required another. One example: when modularity.class_too_large kept triggering on PathResolver, the autonomous path wanted to split it. The architectural answer was an exclusion in governance config, with a documented removal condition. That decision belongs in .intent/. It takes three lines of YAML, not a code change.

The G4 gate is what makes this possible. When governance lives in src/, changing it requires a programmer. When it lives in .intent/, it requires a governor.

What "Done" Honestly Means

The machinery is complete. The empirical evidence is young.

G2's structural guarantee — the circuit-breaker — is real. What I don't yet have is weeks of daemon logs showing sustained convergence across diverse rule namespaces, under varied load, with a full autonomous approval cycle running. The gate is closed by architecture. The demonstration is still accumulating.

I'll write about that when the logs are there to show. The series has been honest about the distance between "designed to work" and "observed working." This is no different.

What's Next

The system is autonomous. The next question is whether it's legible — to someone who isn't its author.

That's Band E. The outward-facing work: making the consequence chain readable to a stranger, making the governor role demonstrable rather than described, making the case that a regulated-industry team could operate this without understanding the source code.

The 79-second self-heal was the internal proof. The external proof is what comes next.

CORE is a governed software factory, actively built by the method it describes — source on GitHub. If you're building in the governed-AI or regulated-software space and this resonates, comments are open.

79 Seconds: Our AI Governance System's First Autonomous Self-Heal

Dariusz Newecki — Sat, 09 May 2026 13:57:13 +0000

I am not a programmer. I wrote zero lines of code today. The system fixed itself.

We've been building CORE — a deterministic governance runtime that surrounds AI with constitutional law so that AI mistakes are detectable, traceable, and recoverable. The pitch is simple: a non-programmer governor holds the why, AI and workers handle the how, and the constitution ensures nothing unauthorized happens.

Today we proved it works. Not in a demo. Not against a toy example. Against a real system that had been stuck for four days.

The State of Things This Morning

The autonomous loop — detect violation → propose fix → approve → execute → verify — hadn't produced a successful commit in four days. The dashboard said last_consequence: 4d ago. The blackboard (our shared state surface) had 55 open findings, none of which the loop could act on. Proposals were being generated and immediately rejected as structurally incoherent.

From the outside it looked alive. Twenty active workers, sensors firing, heartbeats posting. But nothing was moving.

The Investigation

We didn't start by writing code. We started by asking questions of the system itself.

The first query revealed the shape of the problem: 150 failed proposals, 0 executed today, the last consequence three days old. Dig deeper: 128 of those 150 failures were the same error — a constitutional gate blocking the same action, over and over. That's not a bug in the traditional sense. That's the system correctly enforcing its own laws while an upstream generator keeps producing proposals that violate them.

Then: the 55 "open" findings the remediator was supposed to act on — what were they actually? Mostly blackboard.entry_stale meta-findings. The loop was trying to remediate its own observability noise. The actual code violations — 25 of them, confirmed by audit — were invisible, blocked by their own historical entries sitting in abandoned status, which the sensor dedup treated as permanent silencers.

Seven distinct root causes, nested. Each one blocking the diagnostic of the next.

What We Fixed

In order of discovery:

The stale-finding storm. The BlackboardShopManager was scanning all entry types for SLA violations — including heartbeats with a 10-minute SLA. Every daemon restart, thousands of old heartbeat entries immediately exceeded their SLA. One line added to the WHERE clause: AND entry_type IN ('finding', 'proposal'). Storm stopped. Zero new stale findings in 3 minutes versus 3 per minute before.

The consequence chain gap. When a proposal completed successfully, the findings it had addressed stayed in deferred_to_proposal status forever. The failure path had a revival method. The success path had nothing. New method: resolve_deferred_entries_for_completed_proposal(). Symmetric with the failure path. Twelve lines of code.

The proposal collapse. The proposal generator was creating proposals for N files but only including one action — always targeting scope.files[0]. A proposal claiming to fix 8 violations would touch exactly one file and leave 7 untouched. The fix: one ProposalAction per affected file, ordered 0 through N-1. The executor already supported multi-action proposals. Nobody had ever wired the generator correctly.

The DELEGATE routing gap. modularity.class_too_large violations — class-level refactors that require human judgment — were marked PENDING in the remediation map. PENDING entries are excluded from the active map by the loader. So those findings were claimed, found unmappable, and released back to open every 60 seconds. Forever. The fix was a YAML status change: PENDING → DELEGATE. The loader already handled DELEGATE entries. One word changed.

The permanent-silence bug. When we cleared the stale queue, we used abandoned status. What we didn't know: abandoned is treated the same as open by the sensor dedup logic. "Already represented on the blackboard, do not re-post." So the violations we'd cleaned up were now permanently invisible. Filed as a design-level issue — abandoned and "deliberately suppressed" need to be different states. Immediate fix: flip the cleaned-up audit.violation:: entries to resolved, which the sensor correctly treats as "re-detectable."

13:16:18

With the queue clean, the sensors unblocked, and the DELEGATE routing live, the loop had something to work with. A needs_split violation appeared. The remediator created a proposal. We approved it — the first manual approval of the day.

At 13:16:18, ProposalConsumerWorker picked it up. fix.modularity ran. The LLM took 33 seconds to analyze the file. It returned a plan.

The plan had one module. The validator requires at least two for a split.

mark_failed ran. The file changes were reverted. The proposal was marked failed.

Then: revive_findings_for_failed_proposal ran. The deferred finding flipped back to open.

At 13:17:37 — 79 seconds after failure — the finding was re-claimed, a new proposal was created, and it was sitting in the approval inbox.

The loop had self-healed. Without intervention. Traceable at every step.

What "Self-Heal" Actually Means

The LLM produced bad output. The system caught it, reverted the change, put the work back in the queue, and asked again. No data was corrupted. No state was left inconsistent. The governor's role was to review the next proposal and decide whether to approve it.

This is the regulated-industry argument for this kind of governance. You don't need AI to never fail. You need failure to be:

Detectable. The validator caught a 1-module "split" plan before anything was committed.
Bounded. The gate order — Conservation Gate, IntentGuard, plan validator — ensures AI output can't bypass constitutional constraints even if it tries.
Recoverable. The revival mechanism returned the system to a known-good state. The finding was exactly as it was before the failed attempt.
Traceable. Every step — finding posted, claimed, deferred, proposal created, approved, executing, failed, revived, re-claimed — is a timestamped row in a queryable table.

The audit trail isn't bolted on. It's how the loop works.

The Governor Role

I am not a programmer. I wrote zero lines of code today.

What I did: asked questions of the system, recognized when an answer pointed to a design gap rather than a bug, held the line on architectural decisions (backbone workers don't get split autonomously, regardless of what the violation detector says), and approved one proposal when the conditions were right.

The rest was diagnosis, sequencing, and constitutional reasoning. The code came from Claude Code on the development machine, prompted by the analysis. The analysis came from reading the system's own outputs — queries, logs, dashboard — not from reading source files.

That's the governor role. Not "I don't code therefore I'm not involved in technical work." The opposite: deeply involved in technical decisions, operating at the right level of abstraction, with a system that surfaces the right information to make those decisions.

The 79-second self-heal wasn't despite the governance architecture. It was because of it.

What's Next

The loop machinery is sound. The next bottleneck is fix.modularity's prompt — the LLM needs to be told explicitly to produce at least two modules and given responsibility-grouping context from the audit findings. That's prompt engineering work, not infrastructure.

When that's fixed, CORE will autonomously split files, verify the split, commit, re-audit, and confirm the finding is resolved — without a human writing a line of code.

We're close.

CORE is a governed software factory, actively being built by the method it describes — source on GitHub. If you're building in the governed-AI or regulated-software space and this resonates, comments are open.

CORE Closed Its Audit Trail. Then Found 18 Engine Gaps It Couldn't See Before.

Dariusz Newecki — Fri, 01 May 2026 21:35:48 +0000

Six weeks ago I published a post here titled "Your Agent Has Two Logs. One of Them Doesn't Exist Yet."

This week, Band B closed. CORE's second log exists.

Here's what that actually means — and why closing it immediately made things harder.

The two-log problem, briefly

Every autonomous system that touches production code has two logs whether it admits it or not.

Log one: what happened. Files changed, tests ran, commits landed.

Log two: why it happened. What finding triggered what proposal. What approval authorized what execution. What execution caused what file change. What file change produced what new finding.

Log two is the audit trail. In a regulated environment, log two isn't optional — it's the difference between a system you can defend and one you can't.

CORE had log one. Log two was missing.

What Band B actually required

Eight issues. Four ADRs. Seven coordinated write-path decisions — where in the code does attribution get written, in what shape, guaranteed by what gate.

The hard part wasn't the code. It was making the causality chain complete. Every link had to be present:

Finding → which proposal claimed it (and when)
Proposal → which execution consumed it (and what commit resulted)
Execution → which new findings it produced

Miss one link and the chain is decoration, not evidence.

196 commits in April. 25 issues closed. Band B: 8 closed, 0 open.

What happened immediately after

Band D opened with 18 issues.

Not because we introduced regressions. Because closing Band B made the engine's integrity gaps visible in a way they weren't before. You can't measure attribution fidelity until attribution exists. Once it does, you can see exactly where the engine fails to populate it correctly.

This is the convergence principle working as designed. The system gets more capable. It immediately finds more problems with itself. The audit PASS holds — 19 active workers, findings are warnings about modularity, not governance failures. But the work queue doesn't shrink when a band closes. It shifts.

What "GxP-load-bearing" means in practice

I've been building CORE in part for environments like pharmaceutical manufacturing — where an AI system that modifies code or configuration needs to prove it acted within authorized boundaries, on authorized intent, with a complete audit trail.

GxP (Good Practice regulations) doesn't care what your system can do. It cares what your system can prove it did.

Band B is the difference between CORE being a capable tool and CORE being a defensible tool. The second log is what makes it defensible.

What's next

Band D: engine integrity. 18 open issues. The system that now has a complete audit trail needs its engine tightened before those traces are fully trustworthy.

Then Band E: external validation. CORE governing a repository it didn't build.

The second log exists. Now we make sure everything it records is true.

CORE is open source: github.com/DariuszNewecki/CORE

Previous in this series: Your Agent Has Two Logs. One of Them Doesn't Exist Yet.

My Audit Caught My Audit Being Wrong

Dariusz Newecki — Sat, 25 Apr 2026 22:16:52 +0000

And that's exactly what it's supposed to do.

A few days ago I ran a diagnostic on CORE — the governance system I'm building that supervises AI-generated code. The diagnostic was supposed to investigate why a specific audit rule appeared to be silently failing. Not firing. Producing zero findings against files it should have flagged.

I ran the investigation carefully. Stage by stage. I came to a conclusion.

The conclusion was wrong.

And I only found that out because the system itself told me so.

What I thought was happening

CORE has an audit rule called autonomy.tracing.mandatory. It checks that any class ending in Agent contains a mandatory call to self.tracer.record. The logic is straightforward: if an autonomous agent produces work, that work must be traceable. No tracing call — the rule flags it.

My notes said the rule was firing zero findings against SelfHealingAgent — a class with, in fact, zero tracer references. A rule designed to catch exactly that situation, catching nothing.

That's a governance gap. If a rule exists and silently fails, you don't have an audit system. You have a theatrical one.

So I investigated.

What I actually found

The rule was firing. Correctly. Both findings were present, cleanly, in reports/audit_findings.json:

{
  "check_id": "autonomy.tracing.mandatory",
  "severity": "warning",
  "message": "Line 51: missing mandatory call(s): ['self.tracer.record']",
  "file_path": "src/will/agents/self_healing_agent.py"
}

The system wasn't broken. The diagnostic's starting assumption was broken.

Here's where it came from. CORE's audit output is rendered through Rich — a Python library that produces beautiful terminal tables with color, alignment, and spacing. Rich also truncates long strings to fit columns. So autonomy.tracing.mandatory becomes autonomy.tracing.mandat… on screen.

When I ran grep 'tracing.mandatory' against the captured terminal output to verify the finding, I got zero matches. Not because the finding wasn't there — because Rich had silently eaten the last four characters of the rule name, and my grep pattern was looking for the full string.

I used display output as an oracle. Display output lied.

The JSON source of truth never did.

The stage-by-stage result

I re-ran the diagnostic properly, going to primary sources instead of rendered output:

Stage	Status
Rule loaded and mapped	PASS — rule extracted, bound to `ast_gate` engine
Scope resolution	PASS — `self_healing_agent.py` in scope
Engine dispatch	PASS — engine ran against the file
Auto-ignore	PASS — zero suppressions, nothing dropped silently
Finding emitted	PASS — present in `audit_findings.json`

Every stage passed. The investigation had no failure to explain, because there was no failure. It was investigating a ghost.

Direct engine invocation confirmed it independently:

# Standalone check — no orchestrator involved
for node in ast.walk(tree):
    if GenericASTChecks.is_selected(node, selector):
        err = GenericASTChecks.validate_requirement(node, requirement)
        print(type(node).__name__, getattr(node, 'name', '?'), '->', err)

# Output:
# ClassDef SelfHealingAgent -> missing mandatory call(s): ['self.tracer.record']

Same verdict. No ambiguity.

Why this matters more than "I made a mistake"

I'm building a system where AI generates code and a deterministic governance layer audits it. The entire value proposition is that the governance layer is trustworthy. Not smart — trustworthy. You need to be able to look at a finding and know it reflects reality. You need to be able to look at a clean audit and know the system actually checked.

That's called instrument qualification. In regulated industries — pharmaceuticals, medical devices, aerospace — you don't just validate the product. You validate the instruments you used to measure the product. A thermometer that reads 37°C when the actual temperature is 39°C isn't a minor inconvenience. It's a systematic lie that compounds silently across every reading it ever produces.

I accidentally demonstrated the same principle in software.

When I used grep against Rich-rendered terminal output, I was reading from an instrument I hadn't qualified. Rich is a display library. It's not a data source. It's designed to make things readable to humans, not parseable by machines. Using it as a source of truth for a diagnostic is exactly as reliable as doing a medical measurement with a ruler.

The JSON report is the qualified instrument. It's the canonical output. It doesn't truncate. It doesn't wrap. It doesn't abbreviate for column fit. It says what the system found.

A passing audit with many findings is less honest than a failing audit with fewer real ones. An instrument that gives you clean-looking output that misrepresents reality isn't helping you — it's flattering you.

What I changed

Two things.

One: I added the stale references explicitly to the diagnostic record. My notes had two wrong module paths that would have caused anyone running the diagnostic in the future to hit ImportError immediately. AuditorContext is not in mind.logic.engines.ast_gate.base — it's in mind.governance.audit_context. I documented both as stale references, with the correct paths. Constitutional debt is honest debt. Hiding it helps no one.

Two: I documented the grep-against-Rich anti-pattern. Not as a personal failure, but as a category. If I did it, someone else will do it, or I'll do it again in six months under pressure. The pattern needs a name so it can be recognized.

The uncomfortable version

Here's the uncomfortable version of this story: I almost propagated the wrong conclusion.

If I'd stopped at "zero grep matches, rule is not firing," I would have written a finding that said the governance system had a blind spot. I might have gone looking for a fix in the wrong place. I might have introduced a workaround that solved a problem that didn't exist, while leaving a different problem — the unreliable diagnostic method — completely intact.

In a system that supervises autonomous AI code generation, a wrong finding about your audit rules is worse than a missing finding. A missing finding is a gap. A wrong finding is a confidence injection. You become more certain the system is broken in a specific way, and that certainty guides you away from the actual state.

That's the failure mode I'm most worried about in AI-supervised systems generally. Not that the AI is wrong — everyone accepts the AI might be wrong. The failure mode is when the verification layer produces plausible-looking output that you stop checking.

CORE is built on the assumption that every layer lies until verified. Including the diagnostic layer. Including me.

I'm not a programmer. I'm closer to a lawmaker than a coder. I built a governance system because I understand governance better than I understand AST traversal. Swimming against a current you can't even see clearly is exactly the situation where you need your instruments to be honest. Flattery is the thing that drowns you.

The system didn't flatter me. That's not a bug. That's the only thing I actually need it to do.

CORE is an open-source, deterministic governance runtime for AI-generated code. You can find it at github.com/DariuszNewecki/CORE.

The First Test CORE Ever Wrote For Itself

Dariusz Newecki — Sat, 18 Apr 2026 14:40:25 +0000

And why it was wrong — and why that's exactly the point.

Today, at 16:24 CET, my system wrote a test file for itself.

Not a test I wrote. Not a test a developer wrote. A test that CORE — my constitutional governance runtime — autonomously detected was missing, proposed to generate, waited for my approval, and then wrote using its own CoderAgent.

The test was wrong. The methods it tested don't exist. The API it assumed was hallucinated.

And I'm more excited about this than if it had been perfect.

What CORE is (briefly)

CORE is a deterministic governance runtime that surrounds AI code generation with constitutional law. AI produces code, but every output is verified against rules, audited, and must pass governance gates before execution. The human role is governor — not programmer.

I've written about this system before. The previous milestone was when CORE blocked itself — a rule violation preventing its own remediation from executing. Today's milestone is different. Today, the system grew a new autonomous capability.

Stream B: closing the test loop

CORE already has a working autonomous loop for code quality:

AuditViolationSensor detects violation
  → ViolationRemediatorWorker creates proposal
  → ProposalConsumerWorker executes fix
  → Sensor re-runs — finding resolves

Stream B was the same loop, but for test coverage:

TestCoverageSensor detects missing test
  → TestRunnerSensor confirms (pytest)
  → TestRemediatorWorker creates build.tests proposal
  → ProposalConsumerWorker executes → CoderAgent writes test
  → TestRunnerSensor re-runs — pass or fail finding posted

The components didn't exist. We built them today.

What we built

TestCoverageSensor — scans src/ for Python files with no corresponding test file. Posts test.run_required:: findings to the Blackboard. Critically: the scan parameters (source root, test root, excluded filenames) are read from .intent/enforcement/config/test_coverage.yaml at runtime. No paths hardcoded in Python. Changing what gets scanned is a constitution edit, not a code change.

TestRunnerSensor — already existed, just paused. Consumes test.run_required:: findings, runs pytest, posts test.missing or test.failure. Activated today.

TestRemediatorWorker — new acting worker. Claims test.missing and test.failure findings, groups by source_file, creates one build.tests proposal per file. Per-file deduplication: two concurrent proposals for different files are valid and don't block each other.

build.tests AtomicAction — already existed in the registry. Takes source_file, calls CoderAgent, runs auto-heal pipeline (fix.imports, fix.headers, fix.format), IntentGuard validation, writes the test file.

Four components. One closed loop.

The bugs we hit

I'm going to be honest about the path here, because the bugs were instructive.

Bug 1: entry_id vs id.
The BlackboardService contract is clear — all finding dicts use key "id". Somewhere along the way, three files in the codebase had finding["entry_id"] — confusing a local variable name with the dict key. Same fix three times: finding["id"]. The lesson: a contract stated only in docstrings is a contract that will be violated. CORE's next step should be a schema-level enforcement.

Bug 2: Subject prefix mismatch.
ViolationRemediatorWorker only claims findings with prefix audit.violation::. test.missing:: findings sat on the Blackboard unclaimed — the remediation map had the right entries but the worker never saw them. Option A (widen prefix) was ruled out: the worker's core loop reads payload["rule"] for routing, and test findings have no rule key. Option C (dedicated worker) was the right call. TestRemediatorWorker was built. Single responsibility, clean separation.

Bug 3: action_executor not available in daemon context.
build.tests calls core_context.action_executor. At CLI bootstrap time, this attribute is monkey-patched onto CoreContext. The daemon doesn't do this — it passes a bare context. The fix was a hasattr guard, already canonically established in ViolationExecutorWorker with a comment explaining exactly this failure mode. Before applying it, I asked Claude Code to assess the blast radius: three sites in daemon paths were affected. We fixed the blocking one now; the other two go on the Phase 4 queue. Surgical over broad.

The first test

class TestBlackboardAuditor(unittest.TestCase):
    def test_audit_with_valid_data(self):
        mock_data = {
            "entries": [
                {"id": 1, "content": "Task 1", "status": "pending"},
            ]
        }
        result = self.auditor.audit(mock_data)
        self.assertIn("summary", result)

BlackboardAuditor has no audit() method. It has run(), run_loop(), SLA-tier checking, stale entry detection. The LLM invented an API from the class name alone.

Why am I not disappointed?

Because this is iteration zero. The infrastructure works — detection, proposal creation, approval gate, execution, git commit. The quality of the generated test is a separate concern, and it's an addressable one. CoderAgent generated tests without reading the source file first. The fix is to pass the source content as context before generation. That's a build_tests_action.py improvement for the next session.

More importantly: the system caught its own mistake. TestRunnerSensor will run, the tests will fail, test.failure findings will be posted, a repair proposal will be created. The loop continues.

What "autonomous" actually means here

I approved the proposal. I didn't write the test. I didn't write the sensor. I didn't wire the pipeline. I didn't debug the entry_id bug — I read the trace, stated the contract, Claude Code applied the fix.

My role today was:

Architectural decisions (Option A vs B vs C for the subject prefix problem)
Scope control (one file, not 741)
Approval gating (three proposals created, three reviewed, two rejected for cause, one approved)
Quality judgment (the test is wrong — that's useful signal, not a failure)

That is the governor role. Not programming. Governing.

The honest state

What works: The loop closes. Coverage gap detected → test proposed → human approves → test written → failure detected → repair proposed. End-to-end autonomous.

What doesn't yet: The generated tests are hallucinated. CoderAgent wrote tests for an API that doesn't exist because it had no context about what BlackboardAuditor actually does. The path mapping between src/ and tests/ is also hardcoded in two of the three pipeline files — a drift risk I'm aware of and haven't fixed yet.

What's next: The fix is the same pattern CORE already uses for code remediation: build a context package first. Read the source. Understand the architectural role. Then generate. ViolationRemediator calls RemediationInterpretationService.build_reasoning_brief_dict() before invoking any LLM — it passes actual method signatures, constitutional role, and import graph as the reasoning brief. build.tests skips this step entirely. The infrastructure exists. It just isn't wired yet. Fix that, fix the path mapping to read from .intent/ everywhere, then open the scope beyond one file.

The ratio today: one file with tests that fail. Tomorrow: the same loop repairs them.

On instrument qualification

I've written before about the GxP principle I apply to CORE: an instrument must be qualified before you trust its readings. An audit with 252 findings that passes is less trustworthy than one with 78 findings that fails.

Today's first test is wrong. But the instrument that detected "this file has no tests" is correct. The instrument that detected "this test fails" will also be correct.

The loop doesn't need perfect tests to be useful. It needs honest sensors.

CORE is open source. The architecture documents, constitutional rules, and implementation are all public at github.com/DariuszNewecki/CORE. Documentation at dariusznewecki.github.io/CORE.

Previous article in this series: The AI That Refused To Ship Its Own Fix

When My Governance System Governed Itself Wrong

Dariusz Newecki — Tue, 14 Apr 2026 20:08:04 +0000

I built a sensor to detect import order violations. It found 152. The fixer found 0. One of them was lying.

Background

CORE is a deterministic governance runtime I'm building around AI code generation. The core idea is simple: AI produces code, but AI is never trusted. Every output passes through constitutional rules, audit engines, and remediation loops before anything touches the codebase.

One of those loops works like this:

AuditViolationSensor detects violation
    → posts finding to Blackboard
ViolationRemediatorWorker claims finding
    → dispatches AtomicAction (fix.imports, fix.ids, fix.headers, etc.)
Sensor runs again
    → confirms violation gone or re-posts

This is the convergence loop. The goal is that the Blackboard empties over time as violations get fixed. That's what I call A3 — the daemon runs continuously and the codebase converges without me touching anything.

This session I was closing sensor coverage gaps. Several fix actions in dev sync had no corresponding sensor, meaning the daemon was blind to those violations and a human had to run dev sync manually to keep things clean. Not autonomous. Not A3.

One of the gaps was style.import_order. I wrote the sensor, wired it up, restarted the daemon.

152 findings.

The Problem

The sensor was using an AST-based implementation — check_import_order — that classifies imports into groups: future, stdlib, third_party, internal. It then checks that the groups appear in the right order.

The fixer uses ruff --select I, which does the same job but reads its configuration from pyproject.toml:

[tool.ruff.lint.isort]
known-first-party = ["api", "body", "cli", "features", "mind", "services", "shared", "will"]
section-order = ["future", "standard-library", "third-party", "first-party", "local-folder"]

I ran fix.imports --write to clean up before activating the sensor. Zero violations after. Then I activated the sensor. 152 violations.

The sensor and the fixer disagreed on what "correctly ordered imports" means.

Finding the Root Cause

I picked the simplest failing file — src/cli/resources/admin/patterns.py — violation at line 7:

import typer                              # third_party → idx 2
from shared.cli_utils import core_command # internal   → idx 3
from .hub import app                      # ???

The sensor's _classify_root function takes the module name and classifies it. For from .hub import app, a relative import, stmt.module is "hub". "hub" is not in stdlib_names and not in internal_roots, so it falls through to third_party — index 2.

But shared was classified as internal — index 3.

Index 2 after index 3 → violation.

Ruff treats relative imports as local-folder, which comes after first-party in the section order. So ruff considers this file clean. The sensor considers it broken.

Two problems:

Problem 1 — relative imports. The sensor had no concept of them. Any from .something import X got classified as third_party because the module name (something) didn't match any known root. Fix: detect stmt.level > 0 in ast.ImportFrom and classify as local with the highest order index.

Problem 2 — internal roots mismatch. The sensor hardcoded ["shared", "mind", "body", "will", "features"]. Ruff's known-first-party includes ["api", "body", "cli", "features", "mind", "services", "shared", "will"]. Missing: api, cli, services. When a file imports from cli after importing from body, ruff sees two first-party imports in any order — fine. The sensor sees third_party after internal — violation.

Fix: pass internal_roots as a parameter in the enforcement mapping so the sensor reads from configuration rather than hardcoding.

After both fixes: 0 violations. Sensor and fixer agreed.

The Architectural Lesson

This is an instrument qualification problem.

In GxP-regulated environments (pharma, medical devices), before you trust a measurement instrument, you qualify it. You verify that it measures what it claims to measure, using a known reference. An unqualified instrument is not a trusted instrument — even if it produces numbers.

I deployed a sensor without qualifying it against the fixer. The sensor was measuring something real (import order), but measuring it differently than the tool that fixes it. The result was 152 false positives — governance debt that looked real but wasn't.

A sensor that disagrees with its corresponding fixer is worse than no sensor. It creates noise, erodes trust in the Blackboard, and — if the remediator were running — would dispatch fix actions that produce no change, loop, and dispatch again.

The correct pattern before activating any new sensor:

Run the fixer in dry-run mode. Collect what it would change.
Run the sensor. Collect what it would flag.
Verify the two sets agree on the same files.
Only then activate.

CORE doesn't enforce this yet. The gap is now in the backlog as governance.sensor_fixer_coherence — a meta-rule that validates governance components against each other before they're trusted.

What Got Fixed

Three separate changes at three separate levels:

AST logic (src/mind/logic/engines/ast_gate/checks/import_checks.py):

# Before: relative imports fell through to third_party
# After: detect stmt.level > 0 and classify as local (idx=4)
if isinstance(stmt, ast.ImportFrom) and stmt.level > 0:
    grp = "local"
    idx = 4  # always last — after internal

Configuration (.intent/enforcement/mappings/code/style.yaml):

style.import_order:
  engine: ast_gate
  params:
    check_type: import_order
    internal_roots: ["api", "body", "cli", "features", "mind", "services", "shared", "will"]

Tooling — a new core-admin workers blackboard purge command to clear stale findings when a sensor produces false positives before a fix is applied.

Current State

7 sensors active. 52 rules. 0 findings. Blackboard clean.

The convergence loop is running. The daemon detects violations, the remediator dispatches fixes, the sensor confirms they're gone. That's A3.

The sensor-fixer coherence check doesn't exist yet. Until it does, every new sensor I add needs manual qualification before activation. That's a human step where CORE should eventually do the work itself.

Which is the point of the whole project.

CORE is open source: github.com/DariuszNewecki/CORE
Previous posts in this series cover the constitutional model, the autonomous loop, and the ViolationExecutor implementation.

PASSED with 252 findings. FAILED with 78. Which audit would you trust?

Dariusz Newecki — Tue, 07 Apr 2026 21:00:32 +0000

A story about instrument qualification, false positives, and why honest governance sometimes means failing on purpose.

The paradox

This morning, CORE's audit system reported 252 findings and returned a verdict of PASSED.

This evening, it reported 78 findings and returned a verdict of FAILED.

Nothing in production changed. No bugs were introduced. No architecture was violated.

The sensors were fixed.

Finding	Befr	Aftr	Delta
Total findings	252	78	-174
Orphan files	91	0	-91
Modularity (blunt score)	100	0	-100
needs_split	—	19	new
needs_refactor	—	27	new
File size (redundant rule)	29	0	-29
Verdict	PASS	FAIL	honest

The FAILED verdict is the correct one. The PASSED verdict was a compliance illusion.

The instrument qualification problem

In GxP-regulated environments — pharmaceutical manufacturing, medical devices, clinical software — you do not run an assay on an uncalibrated instrument and trust the result. Before any measurement is taken seriously, the instrument must be qualified: it must demonstrably measure what it claims to measure, within defined tolerances, under defined conditions.

This principle is so fundamental that it precedes any discussion of the data itself. Bad data from a qualified instrument is a finding. Bad data from an unqualified instrument is noise — and acting on noise has a name: it is a deviation.

Software governance systems face the same problem. An audit engine that produces findings is an instrument. If that instrument has not been qualified — if its detectors produce false positives, if its thresholds are miscalibrated, if its rules conflate distinct problem classes — then the findings it produces are not evidence. They are noise with a compliance label.

Acting on that noise with automated remediation is not governance. It is confident, expensive, wrong work.

Case 1: The orphan file detector

CORE uses a static import graph traversal to detect source files unreachable from any declared entry point. The principle is sound: if no entry point can reach a file, that file is dead code and should be removed.

The detector flagged 91 files as orphans.

All 91 were false positives.

Static import graph traversal is a deliberate choice — deterministic, auditable, no runtime dependency. The tradeoff is that dynamically-loaded components must be explicitly declared as entry points. That declaration is itself a governance artifact: it makes the implicit loading contract explicit and versioned. The detector was not wrong — the contract was incomplete.

An automated agent pointed at those 91 findings would have deleted live production code. The agent would have been operating correctly within its mandate. The mandate was wrong.

The fix was not to make the detector smarter. It was to declare the dynamically-loaded directories as explicit entry points — converting an implicit runtime convention into a versioned, governed contract. Functionally this resembles static linking. Constitutionally it is different: the declaration is law, subject to change control, with documented rationale. The detector enforces the contract. The contract is owned by governance, not by the build system.

entry_points:
  - "src/will/self_healing/"
  - "src/will/test_generation/"
  - "src/shared/infrastructure/"
  # ... 10 more directories

After the fix: zero orphan findings. Zero code deleted. The codebase did not change. The instrument was qualified.

Case 2: The modularity score

Four rules were producing 100 findings collectively:

modularity.single_responsibility
modularity.semantic_cohesion
modularity.import_coupling
modularity.refactor_score_threshold

All four were proxies for a single composite score. All four mapped to the same remediation action: fix.modularity. All four carried the same enforcement level: reporting.

The problem is that they were measuring two fundamentally different things and treating them identically.

Problem class A: a file is too long with a single coherent responsibility.
This is a mechanical problem. The file does one thing but does too much of it. The solution is splitting — redistributing logic across smaller files along natural seams. No discipline boundaries are crossed. No architectural judgment is required. An automated system can propose and execute this split safely, subject to a Logic Conservation Gate that verifies no logic was lost.

Problem class B: a file mixes distinct architectural disciplines.
A file that combines CLI rendering, database access, and business logic in 300 lines is not a size problem. It is an architectural violation. Resolving it requires a human to decide where each responsibility belongs in the constitutional layer structure. An automated system cannot make that decision safely — not because AI is incapable of generating a proposal, but because the decision carries architectural authority that must remain with a human until the boundaries are formally established.

Conflating these two problems in a single score means the governance system cannot distinguish between what it is allowed to fix autonomously and what it must escalate. That distinction is not a technical nicety. In regulated environments, it is the difference between an approved automated action and an unauthorized architectural change.

The fix was to retire the four proxy rules and replace them with two precise sensors:

{
  "id": "modularity.needs_split",
  "enforcement": "reporting",
  "rationale": "Automatable. Mechanical redistribution, no discipline boundaries crossed."
},
{
  "id": "modularity.needs_refactor",
  "enforcement": "blocking",
  "rationale": "Requires human judgment. Autonomous action prohibited until architectural decision is approved."
}

The blocking enforcement on needs_refactor is the point. It is not a warning. It is a constitutional stop. The system will not proceed autonomously until a human has reviewed and authorized the architectural boundary decision.

This is why the audit now returns FAILED. Twenty-seven files contain mixed-discipline violations. They are real findings. They require real decisions. The system is correctly refusing to act without authorization.

The verdict paradox

A governance system that always passes is not a governance system. It is a reporting system with a green checkbox.

PASSED with 252 findings meant: the system detected many things, none of them were classified as blocking, therefore no action is required. The 91 false positives contributed to a picture of busyness without actionability. The composite modularity score produced findings that the automated remediator could not distinguish from each other. Everything was flagged, nothing was escalated.

FAILED with 78 findings means: the system has detected 27 architectural violations that require human decisions before any automated action proceeds. It has identified 19 files that can be split autonomously, subject to validation gates. Every finding in the report corresponds to a specific, actionable condition.

The failure verdict is evidence that the governance system is functioning correctly. It is not a regression. It is an honest measurement.

The principle

Governance quality is not measured by finding count. It is measured by finding accuracy.

In regulated environments, the difference between a false positive acted upon and a true positive ignored is not a technical footnote. It is a compliance failure. Instrument qualification is not overhead — it is the precondition for trusting any measurement that follows.

Before you ask what your audit found, ask whether your audit can be trusted.

CORE is an open-source constitutional governance runtime for AI-assisted software development. Architecture, governance rules, and enforcement mappings are public.

github.com/DariuszNewecki/CORE

I Spent a Saturday Cleaning My Own Repo. CORE Made Me.

Dariusz Newecki — Sat, 04 Apr 2026 19:42:23 +0000

Not because I wanted to.

Because the system I built demands that everything it touches is defensible. And when I looked honestly at my own repository — the README, the docs, the .gitignore — they weren't.

So I fixed them.

The broken command nobody noticed

It started with a README.

The Quick Start section told anyone who cloned CORE to run:

poetry run core-admin check audit

That command doesn't exist. The correct command is:

poetry run core-admin code audit

One word difference. But anyone who followed that instruction would get an error on their very first interaction with the project. First impression: broken.

The CLI had evolved. The legacy verb-first pattern (check audit) was purged months ago when CORE's command structure was redesigned around resource-first architecture. The README hadn't kept up. It was documenting a command that no longer existed.

"If the docs lie, the system lies."

This is the thing about building a governance runtime: you can't enforce standards on AI-generated code while your own documentation ships broken commands.

CORE's entire thesis is:

Never produce software you cannot defend.

Not rhetorically. Technically, legally, epistemically, historically.

If I can't defend my own README — if the first thing someone tries doesn't work — then I'm not living by the standard I built into the system.

That's not a philosophical problem. It's a credibility problem. And a consistency problem. And those are exactly the problems CORE exists to solve.

What a Saturday of self-governance looks like

Here's what actually got done:

README:

Fixed the broken audit command (check → code)
Removed a stale metric (0 blocking violations) that may or may not have been current
Removed an acknowledgment that no longer reflected the project's direction
Replaced a buried, collapsible workflow diagram with a cleaner conceptual flow — visible immediately, no click required

CONTRIBUTING.md:

Updated the CI description (it had said "smoke testing" — it does more than that now)
Added the audit command so contributors know how to verify compliance locally before opening a PR

.gitignore:

Found that logs/* was missing — only !logs/.gitkeep existed, with no corresponding exclusion rule. Any non-.log file landing in logs/ would have been tracked silently.
Added proper logs/* and reports/* exclusions with the same pattern used for var/ and work/

docs/ — complete rewrite:

The docs site had 111 files across 30 directories, most of them written at various stages of development, not reflecting current architecture
I replaced all of it with six files: index.md, how-it-works.md, autonomy-ladder.md, getting-started.md, cli-reference.md, contributing.md
Every CLI command in the reference was verified against the actual source code — not inferred, not remembered, not guessed

That last point matters. The first draft of cli-reference.md was written by an AI assistant — from inference, not from source. I caught it, pushed back, and made it search the actual command registrations before writing anything. Same standard I apply to everything else.

The CLI reference problem is the whole problem in miniature

The first draft of cli-reference.md was written by an AI assistant — from inference, not from source.

It had wrong subcommands. Plausible ones, but wrong.

core-admin proposals inspect <id> — doesn't exist. It's show.

core-admin inspect status — legacy verb-first pattern, purged months ago. It's core-admin admin status.

core-admin governance coverage — wrong group entirely. It's core-admin constitution status.

Three wrong commands in one file. All confident. All wrong.

I caught it. Pushed back. Asked the assistant to search the actual source code before writing anything. It did. The commands got fixed.

The irony was not subtle: an AI assistant producing plausible but unverified output, in documentation for a system that exists specifically to prevent AI from producing plausible but unverified output.

That's not a documentation problem. That's an epistemic problem. And it's the same one that lives in .intent/northstar/core_northstar.md:

Nothing is assumed silently. All assumptions must be explicit, owned, and traceable. Reasoning requires citation. If CORE cannot point to evidence, it cannot act.

What this has to do with autonomy

CORE is currently at A2+ — governed generation, universal workflow pattern. I'm working toward A3 — strategic autonomy, where CORE identifies and proposes architectural improvements without being asked.

For A3 to be trustworthy, the system has to be clean. Not just the code — the whole project. The README someone reads before cloning. The docs they follow when getting started. The .gitignore that determines what gets committed.

If those are wrong, the foundation is wrong. And you can't build autonomous operation on a wrong foundation.

Cleaning the repo isn't glamorous. It doesn't advance the autonomy ladder. But it's the kind of work the system's own philosophy demands — and that I'd been quietly deferring.

The self-referential part

There's something almost uncomfortable about this.

I built a system that enforces: you cannot ship what you cannot defend. And then I had a README with a broken command, a .gitignore with a missing rule, and a documentation site with 111 files of outdated content.

The system couldn't enforce standards on its own repository — it doesn't govern Markdown files. That's a human responsibility.

Which means the human has to do it.

That's not a failure of CORE. That's the design. .intent/ is human-authored and immutable at runtime. CORE can never write to it. The constitution is mine to maintain.

The same is true for everything outside the autonomy lanes — the README, the docs, the project presentation. CORE governs the code. I govern the rest.

And today I did.

If you're curious

The repo is at github.com/DariuszNewecki/CORE.

If you've looked before and bounced — the docs are cleaner now. The commands in the Quick Start actually work.

If you're new: read .intent/ before the source. That's where the law lives.

Previous in this series: My AI Has 22 Workers, 2,470 Resolved Violations, and Still Can't Call Itself Autonomous. Here's the Gap.

The AI That Refused To Ship Its Own Fix

Dariusz Newecki — Wed, 01 Apr 2026 18:15:43 +0000

Or: what constitutional governance actually looks like in practice

I spent today doing something unusual. I was improving an AI system — and the system kept stopping me from making mistakes.

Not because it was clever. Because it was governed.

The Setup

CORE is a constitutional governance runtime for AI coding agents. The short version: instead of letting AI write code freely and checking quality afterwards, CORE requires every decision to be traceable to declared law before execution begins.

Every rule lives in .intent/. Every action requires authority. Every mutation is defensible or it doesn't happen.

Today's session was about advancing CORE toward A3 — full autonomous operation on Tier 1 violations. Eight items on the A3 readiness checklist were marked ❌. By end of session: one remained.

Here's what the path looked like.

The Gate That Wasn't Wired

Early in the session I wanted to promote two entries in the autonomous remediation map from PENDING to ACTIVE. The Logic Conservation Gate had been implemented — a protection against AI refactors that delete more than 50% of a file's logic. The comment in the YAML said "gated until Logic Conservation Gate implemented." Gate implemented. Promote.

I gave Claude Code the prompt.

It came back with a finding I hadn't expected:

"The gate exists in ComplexityRemediationService. ModularityRemediationService — the actual handler these entries call — never touches it. The comment in the YAML is still accurate."

The gate was implemented. Just not in the right place.

We wired it. Then we promoted.

That's not a dramatic story. That's exactly how it should work. The system forced verification before the claim. No amount of "the gate exists somewhere" was sufficient. It had to exist here, in this path, or the promotion was a lie.

83 Silent Failures, Now Loud

Overnight, 83 proposals failed. Each showed execution_results: {} — empty. The handlers were running but returning nothing.

Three months ago this would have been invisible. The handlers returned ok=True unconditionally. Internal errors were swallowed. The proposal consumer would mark everything COMPLETED and move on.

Yesterday we fixed that. Wrapped every handler in try/except. Derived ok from actual outcomes instead of hardcoding success.

So this morning: 83 failures instead of 83 false completions.

That's progress. Honest failure is worth more than dishonest success. CORE's constitution says exactly this:

"CORE must never produce software it cannot defend."

A system that lies about its own outcomes cannot defend them.

319 Stuck Findings

The blackboard showed 319 entries in claimed status. All with claimed_by = NULL.

Legacy entries — claimed before we added atomic claiming with worker identity. The fix was one SQL statement. But finding it required reading the blackboard, querying claimed_by, and tracing the pattern.

No amount of assuming "the system is fine" would have found this. The evidence had to be read. The constitution demands it:

"Memory without evidence is forbidden."

After the fix, a new batch of 319 appeared — this time with a real UUID. The worker was claiming findings, finding no handler for them in the remediation map, and leaving them stuck.

Another fix: release unmappable findings immediately at claim time.

Each fix revealed by the system's own honesty about its state.

What Makes This Different

Most AI coding tools measure success by output volume. Lines written, tickets closed, PRs merged.

CORE measures success by defensibility. Can you explain why this change was made? Under what authority? With what evidence? What happens if it's wrong?

Today we made 14 commits. Each traceable to a checklist item. Each verified by the system before and after. The daemon either ran clean or it didn't. The blackboard either showed stuck entries or it didn't.

The AI didn't just write code. It was governed while writing code. And when the governance caught a mistake — the gate that wasn't wired, the handler that lied about success, the findings that stayed claimed forever — we fixed the governance, not just the symptom.

That's the mind shift. Not "AI writes code faster." But:

"Law governs intelligence. Defensibility outranks productivity."

Who This Is For

CORE is not for everyone. It's explicitly not for casual app builders or speed-only workflows.

It's for regulated environments. Safety-critical systems. Teams where "the AI decided" is not an acceptable answer in a post-mortem.

If that's your world — the architecture is open. The constitution is public.

🔗 github.com/DariuszNewecki/CORE

And if you think in terms of governance rather than just generation — I'm looking for collaborators. Not necessarily programmers. People who understand that software systems need to be able to explain themselves.

Written the same day the session happened. The daemon is running clean as I type this.